Re:

2018-08-10 Thread Curtis Ruck
I created NIFI-5506 for the wantClientAuth specific issue, and submitted a
WIP PR#2944 for review.

Besides issues with getting OIDC working (on the OIDC Server side), this
enables external providers.  Potentially, this could be amended to include
X509 through reverse proxy by way of a request header, but considering that
wouldn't work with a reverse proxy without this PR, I considered it out of
scope of my near term issue.

--
Curtis Ruck


On Thu, Aug 9, 2018 at 3:47 PM Curtis Ruck  wrote:

> The issue with Reverse Proxies and "certificates or other provider" is
> that if want=true, then the reverse proxy provides it's certificate, if a
> machine certificate is configured.  In Apache HTTPD, this machine
> certificate can be set at a Server or VHost level, not individual proxy
> rules, so to remove it for NiFi, i have to remove it for our other apps
> that "require" X509 client auth, and then either do a SSO workflow, or
> consume Reverse Proxy provided "authentication" details.  This turns
> "certificates or other provider" ends up at "reverse proxy certificate
> only".  So if Bob and Tim visit the reverse proxy, Nifi believes they are
> "Reverse Proxy" not Bob or Tim.
>
> Ideally I need "other provider", because my "other provider" does PKI
> authentication as part of SSO.  I could use "certificates or other
> provider", if NiFi could recognize Reverse Proxy validated certificates
> passed in via a request header.  If the reverse proxy doesn't provide a
> certificate, then it uses "other provider".  JBoss, Tomcat, all provide
> this functionality.
>
> I've already changed " else { setWantClientAuth(false); }" change in a
> fork and it got me closer to my customer's end goals.  I'm trying to get
> this change into NiFi so we don't have to maintain this fork.  I believe
> implementing this with a default want=true would not break any existing
> users of NiFi, and it would allow better integration with Reverse Proxies
> and Single Sign On.
>
> So In the near term i'd like a new nifi property setting to disable
> wantClientAuth, with the default enabled.
> In the long term it would be ideal to support external authn/z providers
> as first class.
>
> --
> My perspective comes from implementing Single Sign On in applications that
> don't always support it for over a decade for ~100 applications all sitting
> behind a Reverse Proxies, providing true single sign on without users
> having to do any special instructions to authenticate.  I'm a true believer
> that the best security is when the security doesn't impact the users, and
> proper single sign on allows application developers focus on their
> application's logic and not their AuthN/AuthZ security model.
>
> --
> Curtis Ruck
>
>
>
> On Thu, Aug 9, 2018 at 3:00 PM Andy LoPresto  wrote:
>
>> I think we agree in our assessment of what the code is doing and disagree
>> in our desire for how that should occur. If OIDC is enabled and
>> isClientAuthRequiredForRestApi() returns false, the result is:
>>
>> // Functionally equivalent to contextFactory.setNeedClientAuth(false);
>> contextFactory.setWantClientAuth(true);
>>
>> That means that the server will request a client certificate if
>> available, but will not require its presence to negotiate the TLS
>> handshake. You are asking to set contextFactory.setWantClientAuth(false);
>> as well, which will suppress the certificate selection dialog. If
>> needClientAuth and wantClientAuth are both false, client certificates
>> cannot be used to authenticate as they will never be sent from the browser.
>> This will effectively allow you to choose between “certificates only”,
>> “certificates or other provider”, and “other provider only (no
>> certificates)”.
>>
>> I am saying that core NiFi *always* accepts client certificates as an
>> authentication mechanism; there is no scenario in which need and want are
>> both set to false. This is by design. Again, I am not saying this can never
>> change, but because of the expectations, documentation, and shared
>> knowledge around this mechanism, changing it is (in my opinion) a major
>> change, and should not be done in a minor release. Other project members
>> may (and probably do) disagree with me.
>>
>> A property in nifi.properties which defaults to “off” but when manually
>> enabled can bypass this requirement is an option. I don’t think we disagree
>> on how to implement this specific change; I think we differ only on how
>> impactful it will be. My perspective comes from supporting a large number
>> of users with a broad variety of (often conflicting) requirements, and
>> sometimes (both they and I have) very little knowledge of the rest of their
>> IT ecosystem. I believe your perspective comes from a specific user with
>> specific requirements. That’s why I recommended making the localized change
>> you need in a fork of the project, so you can achieve your objective in a
>> timeframe that is not blocked by other parties.
>>
>> Andy LoPresto
>> alopre...@apache.org

Re: NiFi 1.6.0 cluster stability with Site-to-Site

2018-08-10 Thread Joe Gresock
Any nifi developers on this list that have any suggestions?

On Wed, Aug 8, 2018 at 7:38 AM Joe Gresock  wrote:

> I am running a 7-node NiFi 1.6.0 cluster that performs fairly well when
> it's simply processing its own data (putting records in Elasticsearch,
> MongoDB, running transforms, etc.).  However, when we add receiving
> Site-to-Site traffic to the mix, the CPU spikes to the point that the nodes
> can't talk to each other, resulting in the inability to view or modify the
> flow in the console.
>
> I have tried some basic things to mitigate this:
> - Requested that the sending party use a comma-separated list of all 7 of
> our nodes in their Remote Process Group that points to our cluster, in
> hopes that that will help balance the requests
> - Requested that the sending party use some of the batching settings on
> the Remote Port (i.e., Count = 20, Size = 100 MB, Duration = 10 sec)
> - Reduced the thread count on our Input Port to 2
>
> Are there any known nifi.properties that can be set to help mitigate this
> problem?  Again, it only seems to be a problem when we are both receiving
> site-to-site traffic and doing our normal processing, but taking each of
> those activities in isolation seems to be okay.
>
> Thanks,
> Joe
>


-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.*-Philippians 4:12-13*


Re:

2018-08-10 Thread Joe Witt
Curtis

Now that there is also a PR for this I'll comment directly there as
well to the specifics of the PR.

In reviewing the discussion here..

There is consensus that enabling the pattern of REST API interaction
you need for your case is a valuable capability.

However, we have not achieved consensus on how best to address it.

And in reviewing the PR and considering its impacts more broadly there
are concerns that it breaks the necessary model for site-to-site to
work properly and cluster replication.  These are designed to expect
certificates to be present and while proxying mechanisms are supported
through various headers the style you'd need in your model would not
be supported.

I dont believe your earlier statement that this change doesn't break
anything can be supported at this time. I'll comment to the specifics
of that concern on the PR so we keep PR.

If there is indeed a path to enable your use case and support existing
capabilities then that's great and lets find it.  Your expertise in
these proxies and authentication models combined with the community
knowledge of how NiFi works today and how to get it where it needs to
be is key.

Thanks
Joe
On Fri, Aug 10, 2018 at 8:59 AM Curtis Ruck  wrote:
>
> I created NIFI-5506 for the wantClientAuth specific issue, and submitted a 
> WIP PR#2944 for review.
>
> Besides issues with getting OIDC working (on the OIDC Server side), this 
> enables external providers.  Potentially, this could be amended to include 
> X509 through reverse proxy by way of a request header, but considering that 
> wouldn't work with a reverse proxy without this PR, I considered it out of 
> scope of my near term issue.
>
> --
> Curtis Ruck
>
>
> On Thu, Aug 9, 2018 at 3:47 PM Curtis Ruck  wrote:
>>
>> The issue with Reverse Proxies and "certificates or other provider" is that 
>> if want=true, then the reverse proxy provides it's certificate, if a machine 
>> certificate is configured.  In Apache HTTPD, this machine certificate can be 
>> set at a Server or VHost level, not individual proxy rules, so to remove it 
>> for NiFi, i have to remove it for our other apps that "require" X509 client 
>> auth, and then either do a SSO workflow, or consume Reverse Proxy provided 
>> "authentication" details.  This turns "certificates or other provider" ends 
>> up at "reverse proxy certificate only".  So if Bob and Tim visit the reverse 
>> proxy, Nifi believes they are "Reverse Proxy" not Bob or Tim.
>>
>> Ideally I need "other provider", because my "other provider" does PKI 
>> authentication as part of SSO.  I could use "certificates or other 
>> provider", if NiFi could recognize Reverse Proxy validated certificates 
>> passed in via a request header.  If the reverse proxy doesn't provide a 
>> certificate, then it uses "other provider".  JBoss, Tomcat, all provide this 
>> functionality.
>>
>> I've already changed " else { setWantClientAuth(false); }" change in a fork 
>> and it got me closer to my customer's end goals.  I'm trying to get this 
>> change into NiFi so we don't have to maintain this fork.  I believe 
>> implementing this with a default want=true would not break any existing 
>> users of NiFi, and it would allow better integration with Reverse Proxies 
>> and Single Sign On.
>>
>> So In the near term i'd like a new nifi property setting to disable 
>> wantClientAuth, with the default enabled.
>> In the long term it would be ideal to support external authn/z providers as 
>> first class.
>>
>> --
>> My perspective comes from implementing Single Sign On in applications that 
>> don't always support it for over a decade for ~100 applications all sitting 
>> behind a Reverse Proxies, providing true single sign on without users having 
>> to do any special instructions to authenticate.  I'm a true believer that 
>> the best security is when the security doesn't impact the users, and proper 
>> single sign on allows application developers focus on their application's 
>> logic and not their AuthN/AuthZ security model.
>>
>> --
>> Curtis Ruck
>>
>>
>>
>> On Thu, Aug 9, 2018 at 3:00 PM Andy LoPresto  wrote:
>>>
>>> I think we agree in our assessment of what the code is doing and disagree 
>>> in our desire for how that should occur. If OIDC is enabled and 
>>> isClientAuthRequiredForRestApi() returns false, the result is:
>>>
>>> // Functionally equivalent to contextFactory.setNeedClientAuth(false);
>>> contextFactory.setWantClientAuth(true);
>>>
>>> That means that the server will request a client certificate if available, 
>>> but will not require its presence to negotiate the TLS handshake. You are 
>>> asking to set contextFactory.setWantClientAuth(false); as well, which will 
>>> suppress the certificate selection dialog. If needClientAuth and 
>>> wantClientAuth are both false, client certificates cannot be used to 
>>> authenticate as they will never be sent from the browser. This will 
>>> effectively allow you to choose between “certificates only”, “certificates 
>

Re: NiFi 1.6.0 cluster stability with Site-to-Site

2018-08-10 Thread Martijn Dekkers
Whats the OS you are running on? What kind of systems? Memory stats,
network stats, JVM stats etc. How much data coming through?

On 10 August 2018 at 16:06, Joe Gresock  wrote:

> Any nifi developers on this list that have any suggestions?
>
> On Wed, Aug 8, 2018 at 7:38 AM Joe Gresock  wrote:
>
>> I am running a 7-node NiFi 1.6.0 cluster that performs fairly well when
>> it's simply processing its own data (putting records in Elasticsearch,
>> MongoDB, running transforms, etc.).  However, when we add receiving
>> Site-to-Site traffic to the mix, the CPU spikes to the point that the nodes
>> can't talk to each other, resulting in the inability to view or modify the
>> flow in the console.
>>
>> I have tried some basic things to mitigate this:
>> - Requested that the sending party use a comma-separated list of all 7 of
>> our nodes in their Remote Process Group that points to our cluster, in
>> hopes that that will help balance the requests
>> - Requested that the sending party use some of the batching settings on
>> the Remote Port (i.e., Count = 20, Size = 100 MB, Duration = 10 sec)
>> - Reduced the thread count on our Input Port to 2
>>
>> Are there any known nifi.properties that can be set to help mitigate this
>> problem?  Again, it only seems to be a problem when we are both receiving
>> site-to-site traffic and doing our normal processing, but taking each of
>> those activities in isolation seems to be okay.
>>
>> Thanks,
>> Joe
>>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can
> do all this through him who gives me strength.*-Philippians 4:12-13*
>


Large JSON File Best Practice Question

2018-08-10 Thread Benjamin Janssen
All, I'm seeking some advice on best practices for dealing with FlowFiles
that contain a large volume of JSON records.

My flow works like this:

Receive a FlowFile with millions of JSON records in it.

Potentially filter out some of the records based on the value of the JSON
fields.  (custom processor uses a regex and a json path to produce a
"matched" and "not matched" output path)

Potentially split the FlowFile into multiple FlowFiles based on the value
of one of the JSON fields (custom processor uses a json path and groups
into output FlowFiles based on the value).

Potentially split the FlowFile into uniformly sized smaller chunks to
prevent choking downstream systems on the file size (we use SplitText when
the data is newline delimited, don't currently have a way when the data is
a JSON array of records)

Strip out some of the JSON fields (using a JoltTransformJSON).

At the end, wrap each JSON record in a proprietary format (custom processor
wraps each JSON record)

This flow is roughly similar across several different unrelated data sets.

The input data files are occasionally provided in a single JSON array and
occasionally as newline delimited JSON records.  In general, we've found
newline delimited JSON records far easier to work with because we can
process them one at a time without loading the entire FlowFile into memory
(which we have to do for the array variant).

However, if we are to use JoltTransformJSON to strip out or modify some of
the JSON contents, it appears to only operate on an array (which is
problematic from the memory footprint standpoint).

We don't really want to break our FlowFiles up into individual JSON records
as the number of FlowFiles the system would have to handle would be orders
of magnitudes larger than it is now.

Is our approach of moving towards newline delimited JSON a good one?  If
so, is there anything that would be recommended for replacing
JoltTransformJSON?  Or should we build a custom processor?  Or is this a
reasonable feature request for the JoltTransformJSON processor to support
new line delimited json?

Or should we be looking into ways to do lazy loading of the JSON arrays in
our custom processors (I have no clue how easy or hard this would be to
do)?  My little bit of googling suggests this would be difficult.


Re: Large JSON File Best Practice Question

2018-08-10 Thread Joe Witt
ben

are you familiar with the record readers, writers, and associated
processors?

i suspect if you make a record writer for your custom format at the end of
the flow chain youll get great performance and control.

thanks

On Fri, Aug 10, 2018, 4:27 PM Benjamin Janssen  wrote:

> All, I'm seeking some advice on best practices for dealing with FlowFiles
> that contain a large volume of JSON records.
>
> My flow works like this:
>
> Receive a FlowFile with millions of JSON records in it.
>
> Potentially filter out some of the records based on the value of the JSON
> fields.  (custom processor uses a regex and a json path to produce a
> "matched" and "not matched" output path)
>
> Potentially split the FlowFile into multiple FlowFiles based on the value
> of one of the JSON fields (custom processor uses a json path and groups
> into output FlowFiles based on the value).
>
> Potentially split the FlowFile into uniformly sized smaller chunks to
> prevent choking downstream systems on the file size (we use SplitText when
> the data is newline delimited, don't currently have a way when the data is
> a JSON array of records)
>
> Strip out some of the JSON fields (using a JoltTransformJSON).
>
> At the end, wrap each JSON record in a proprietary format (custom
> processor wraps each JSON record)
>
> This flow is roughly similar across several different unrelated data sets.
>
> The input data files are occasionally provided in a single JSON array and
> occasionally as newline delimited JSON records.  In general, we've found
> newline delimited JSON records far easier to work with because we can
> process them one at a time without loading the entire FlowFile into memory
> (which we have to do for the array variant).
>
> However, if we are to use JoltTransformJSON to strip out or modify some of
> the JSON contents, it appears to only operate on an array (which is
> problematic from the memory footprint standpoint).
>
> We don't really want to break our FlowFiles up into individual JSON
> records as the number of FlowFiles the system would have to handle would be
> orders of magnitudes larger than it is now.
>
> Is our approach of moving towards newline delimited JSON a good one?  If
> so, is there anything that would be recommended for replacing
> JoltTransformJSON?  Or should we build a custom processor?  Or is this a
> reasonable feature request for the JoltTransformJSON processor to support
> new line delimited json?
>
> Or should we be looking into ways to do lazy loading of the JSON arrays in
> our custom processors (I have no clue how easy or hard this would be to
> do)?  My little bit of googling suggests this would be difficult.
>


Re: NiFi 1.6.0 cluster stability with Site-to-Site

2018-08-10 Thread Joe Witt
Joe G

I do recall there were some fixes and improvements related to
clustering performance/thread pooling/ as it relates to site to site.
I dont recall precisely which version they went into but i'd strongly
recommend trying the latest release if you're able.

Thanks
On Fri, Aug 10, 2018 at 4:13 PM Martijn Dekkers  wrote:
>
> Whats the OS you are running on? What kind of systems? Memory stats, network 
> stats, JVM stats etc. How much data coming through?
>
> On 10 August 2018 at 16:06, Joe Gresock  wrote:
>>
>> Any nifi developers on this list that have any suggestions?
>>
>> On Wed, Aug 8, 2018 at 7:38 AM Joe Gresock  wrote:
>>>
>>> I am running a 7-node NiFi 1.6.0 cluster that performs fairly well when 
>>> it's simply processing its own data (putting records in Elasticsearch, 
>>> MongoDB, running transforms, etc.).  However, when we add receiving 
>>> Site-to-Site traffic to the mix, the CPU spikes to the point that the nodes 
>>> can't talk to each other, resulting in the inability to view or modify the 
>>> flow in the console.
>>>
>>> I have tried some basic things to mitigate this:
>>> - Requested that the sending party use a comma-separated list of all 7 of 
>>> our nodes in their Remote Process Group that points to our cluster, in 
>>> hopes that that will help balance the requests
>>> - Requested that the sending party use some of the batching settings on the 
>>> Remote Port (i.e., Count = 20, Size = 100 MB, Duration = 10 sec)
>>> - Reduced the thread count on our Input Port to 2
>>>
>>> Are there any known nifi.properties that can be set to help mitigate this 
>>> problem?  Again, it only seems to be a problem when we are both receiving 
>>> site-to-site traffic and doing our normal processing, but taking each of 
>>> those activities in isolation seems to be okay.
>>>
>>> Thanks,
>>> Joe
>>
>>
>>
>> --
>> I know what it is to be in need, and I know what it is to have plenty.  I 
>> have learned the secret of being content in any and every situation, whether 
>> well fed or hungry, whether living in plenty or in want.  I can do all this 
>> through him who gives me strength.-Philippians 4:12-13
>
>


Re: Large JSON File Best Practice Question

2018-08-10 Thread Benjamin Janssen
I am not.  I continued googling for a bit after sending my email and
stumbled upon a slide deck by Brian Bende.  I think my initial concern
looking at it is that it seems to require schema knowledge.

For most of our data sets, we operate in a space where we have a handful of
guaranteed fields and who knows what other fields the upstream provider is
going to send us.  We want to operate on the data in a manner that is
non-destructive to unanticipated fields.  Is that a blocker for using the
RecordReader stuff?

On Fri, Aug 10, 2018 at 4:30 PM Joe Witt  wrote:

> ben
>
> are you familiar with the record readers, writers, and associated
> processors?
>
> i suspect if you make a record writer for your custom format at the end of
> the flow chain youll get great performance and control.
>
> thanks
>
> On Fri, Aug 10, 2018, 4:27 PM Benjamin Janssen 
> wrote:
>
>> All, I'm seeking some advice on best practices for dealing with FlowFiles
>> that contain a large volume of JSON records.
>>
>> My flow works like this:
>>
>> Receive a FlowFile with millions of JSON records in it.
>>
>> Potentially filter out some of the records based on the value of the JSON
>> fields.  (custom processor uses a regex and a json path to produce a
>> "matched" and "not matched" output path)
>>
>> Potentially split the FlowFile into multiple FlowFiles based on the value
>> of one of the JSON fields (custom processor uses a json path and groups
>> into output FlowFiles based on the value).
>>
>> Potentially split the FlowFile into uniformly sized smaller chunks to
>> prevent choking downstream systems on the file size (we use SplitText when
>> the data is newline delimited, don't currently have a way when the data is
>> a JSON array of records)
>>
>> Strip out some of the JSON fields (using a JoltTransformJSON).
>>
>> At the end, wrap each JSON record in a proprietary format (custom
>> processor wraps each JSON record)
>>
>> This flow is roughly similar across several different unrelated data sets.
>>
>> The input data files are occasionally provided in a single JSON array and
>> occasionally as newline delimited JSON records.  In general, we've found
>> newline delimited JSON records far easier to work with because we can
>> process them one at a time without loading the entire FlowFile into memory
>> (which we have to do for the array variant).
>>
>> However, if we are to use JoltTransformJSON to strip out or modify some
>> of the JSON contents, it appears to only operate on an array (which is
>> problematic from the memory footprint standpoint).
>>
>> We don't really want to break our FlowFiles up into individual JSON
>> records as the number of FlowFiles the system would have to handle would be
>> orders of magnitudes larger than it is now.
>>
>> Is our approach of moving towards newline delimited JSON a good one?  If
>> so, is there anything that would be recommended for replacing
>> JoltTransformJSON?  Or should we build a custom processor?  Or is this a
>> reasonable feature request for the JoltTransformJSON processor to support
>> new line delimited json?
>>
>> Or should we be looking into ways to do lazy loading of the JSON arrays
>> in our custom processors (I have no clue how easy or hard this would be to
>> do)?  My little bit of googling suggests this would be difficult.
>>
>


Re: AVRO is the only output format with ExecuteSQL

2018-08-10 Thread Matt Burgess
Boris et al,

I put up a PR [1] to add ExecuteSQLRecord and QueryDatabaseTableRecord
under NIFI-4517, in case anyone wants to play around with it :)

Regards,
Matt

[1] https://github.com/apache/nifi/pull/2945
On Tue, Aug 7, 2018 at 8:30 PM Boris Tyukin  wrote:
>
> Matt, you rock!! thank you!!
>
> On Tue, Aug 7, 2018 at 5:16 PM Matt Burgess  wrote:
>>
>> Sounds good, it makes the underlying code a bit more complicated but I see 
>> from y’all’s points that a “separate” processor is a better user experience. 
>> I’m knee deep in it as we speak, hope to have a PR up in a few days.
>>
>> Thanks,
>> Matt
>>
>>
>> On Aug 7, 2018, at 5:07 PM, Andrew Grande  wrote:
>>
>> I'd really like to see the Record suffix on the processor for 
>> discoverability, as already mentioned.
>>
>> Andrew
>>
>> On Tue, Aug 7, 2018, 2:16 PM Matt Burgess  wrote:
>>>
>>> Yeah that's definitely doable, most of the logic for writing a
>>> ResultSet to a Flow File is localized (currently to JdbcCommon but
>>> also in ResultSetRecordSet), so I wouldn't think it would be too much
>>> refactor. What are folks thoughts on whether to add a Record Writer
>>> property to the existing ExecuteSQL or subclass it to a new processor
>>> called ExecuteSQLRecord? The former is more consistent with how the
>>> SiteToSite reporting tasks work, but this is a processor. The latter
>>> is more consistent with the way we've done other record processors,
>>> and the benefit there is that we don't have to add a bunch of
>>> documentation to fields that will be ignored (such as the Use Avro
>>> Logical Types property which we wouldn't need in a ExecuteSQLRecord).
>>> Having said that, we will want to offer the same options in the Avro
>>> Reader/Writer, but Peter is working on that under NIFI-5405 [1].
>>>
>>> Thanks,
>>> Matt
>>>
>>> [1] https://issues.apache.org/jira/browse/NIFI-5405
>>>
>>> On Tue, Aug 7, 2018 at 2:06 PM Andy LoPresto  wrote:
>>> >
>>> > Matt,
>>> >
>>> > Would extending the core ExecuteSQL processor with an ExecuteSQLRecord 
>>> > processor also work? I wonder about discoverability if only one processor 
>>> > is present and in other places we explicitly name the processors which 
>>> > handle records as such. If the ExecuteSQL processor handled all the SQL 
>>> > logic, and the ExecuteSQLRecord processor just delegated most of the 
>>> > processing in its #onTrigger() method to super, do you foresee any 
>>> > substantial difficulties? It might require some refactoring of the parent 
>>> > #onTrigger() to service methods.
>>> >
>>> >
>>> > Andy LoPresto
>>> > alopre...@apache.org
>>> > alopresto.apa...@gmail.com
>>> > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>>> >
>>> > On Aug 7, 2018, at 10:25 AM, Andrew Grande  wrote:
>>> >
>>> > As a side note, one has to ha e a serious justification _not_ to use 
>>> > record-based processors. The benefits, including performance, are too 
>>> > numerous to call out here.
>>> >
>>> > Andrew
>>> >
>>> > On Tue, Aug 7, 2018, 1:15 PM Mark Payne  wrote:
>>> >>
>>> >> Boris,
>>> >>
>>> >> Using a Record-based processor does not mean that you need to define a 
>>> >> schema upfront. This is
>>> >> necessary if the source itself cannot provide a schema. However, since 
>>> >> it is pulling structured data
>>> >> and the schema can be inferred from the database, you wouldn't need to. 
>>> >> As Matt was saying, your
>>> >> Record Writer can simply be configured to Inherit Record Schema. It can 
>>> >> then write the schema to
>>> >> the "avro.schema" attribute or you can choose "Do Not Write Schema". 
>>> >> This would still allow the data
>>> >> to be written in JSON, CSV, etc.
>>> >>
>>> >> You could also have the Record Writer choose to write the schema using 
>>> >> the "avro.schema" attribute,
>>> >> as mentioned above, and then have any down-stream processors read the 
>>> >> schema from this attribute.
>>> >> This would allow you to use any record-oriented processors you'd like 
>>> >> without having to define the
>>> >> schema yourself, if you don't want to.
>>> >>
>>> >> Thanks
>>> >> -Mark
>>> >>
>>> >>
>>> >>
>>> >> On Aug 7, 2018, at 12:37 PM, Boris Tyukin  wrote:
>>> >>
>>> >> thanks for all the responses! it means I am not the only one interested 
>>> >> in this topic.
>>> >>
>>> >> Record-aware version would be really nice, but a lot of times I do not 
>>> >> want to use record-based processors since I need to define a schema for 
>>> >> input/output upfront and just want to run SQL query and get whatever 
>>> >> results back. It just adds an extra step that will be subject to 
>>> >> break/support.
>>> >>
>>> >> Similar to Kafka processors, it is nice to have an option of 
>>> >> record-based processor vs. message oriented processor. But if one 
>>> >> processor can do it all, it is even better :)
>>> >>
>>> >>
>>> >> On Tue, Aug 7, 2018 at 9:28 AM Matt Burgess  wrote:
>>> >>>
>>> >>> I'm definitely interested in supporting a record-aware version as well
>>> >>> (I wr

Re: NiFi 1.6.0 cluster stability with Site-to-Site

2018-08-10 Thread Michael Moser
When I read this I thought of NIFI-4598 [1] and this may be what Joe
remembers, too.  If your site-to-site clients are older than 1.5.0, then
maybe this is a factor?

[1] - https://issues.apache.org/jira/browse/NIFI-4598

-- Mike


On Fri, Aug 10, 2018 at 4:43 PM Joe Witt  wrote:

> Joe G
>
> I do recall there were some fixes and improvements related to
> clustering performance/thread pooling/ as it relates to site to site.
> I dont recall precisely which version they went into but i'd strongly
> recommend trying the latest release if you're able.
>
> Thanks
> On Fri, Aug 10, 2018 at 4:13 PM Martijn Dekkers 
> wrote:
> >
> > Whats the OS you are running on? What kind of systems? Memory stats,
> network stats, JVM stats etc. How much data coming through?
> >
> > On 10 August 2018 at 16:06, Joe Gresock  wrote:
> >>
> >> Any nifi developers on this list that have any suggestions?
> >>
> >> On Wed, Aug 8, 2018 at 7:38 AM Joe Gresock  wrote:
> >>>
> >>> I am running a 7-node NiFi 1.6.0 cluster that performs fairly well
> when it's simply processing its own data (putting records in Elasticsearch,
> MongoDB, running transforms, etc.).  However, when we add receiving
> Site-to-Site traffic to the mix, the CPU spikes to the point that the nodes
> can't talk to each other, resulting in the inability to view or modify the
> flow in the console.
> >>>
> >>> I have tried some basic things to mitigate this:
> >>> - Requested that the sending party use a comma-separated list of all 7
> of our nodes in their Remote Process Group that points to our cluster, in
> hopes that that will help balance the requests
> >>> - Requested that the sending party use some of the batching settings
> on the Remote Port (i.e., Count = 20, Size = 100 MB, Duration = 10 sec)
> >>> - Reduced the thread count on our Input Port to 2
> >>>
> >>> Are there any known nifi.properties that can be set to help mitigate
> this problem?  Again, it only seems to be a problem when we are both
> receiving site-to-site traffic and doing our normal processing, but taking
> each of those activities in isolation seems to be okay.
> >>>
> >>> Thanks,
> >>> Joe
> >>
> >>
> >>
> >> --
> >> I know what it is to be in need, and I know what it is to have plenty.
> I have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.-Philippians 4:12-13
> >
> >
>


Re: NiFi 1.6.0 cluster stability with Site-to-Site

2018-08-10 Thread Joe Witt
Yep what Mike points to is exactly what I was thinking of.  Since
you're on 1.6.0 then probably the issue is something else.  1.6
included an updated jersey client or something related to that.  Its
performance was really bad for our case.  In 1.7.0 it was replaced
with an implementation leveraging okhttp.  This may be and important
factor.

thanks
On Fri, Aug 10, 2018 at 5:02 PM Michael Moser  wrote:
>
> When I read this I thought of NIFI-4598 [1] and this may be what Joe 
> remembers, too.  If your site-to-site clients are older than 1.5.0, then 
> maybe this is a factor?
>
> [1] - https://issues.apache.org/jira/browse/NIFI-4598
>
> -- Mike
>
>
> On Fri, Aug 10, 2018 at 4:43 PM Joe Witt  wrote:
>>
>> Joe G
>>
>> I do recall there were some fixes and improvements related to
>> clustering performance/thread pooling/ as it relates to site to site.
>> I dont recall precisely which version they went into but i'd strongly
>> recommend trying the latest release if you're able.
>>
>> Thanks
>> On Fri, Aug 10, 2018 at 4:13 PM Martijn Dekkers  
>> wrote:
>> >
>> > Whats the OS you are running on? What kind of systems? Memory stats, 
>> > network stats, JVM stats etc. How much data coming through?
>> >
>> > On 10 August 2018 at 16:06, Joe Gresock  wrote:
>> >>
>> >> Any nifi developers on this list that have any suggestions?
>> >>
>> >> On Wed, Aug 8, 2018 at 7:38 AM Joe Gresock  wrote:
>> >>>
>> >>> I am running a 7-node NiFi 1.6.0 cluster that performs fairly well when 
>> >>> it's simply processing its own data (putting records in Elasticsearch, 
>> >>> MongoDB, running transforms, etc.).  However, when we add receiving 
>> >>> Site-to-Site traffic to the mix, the CPU spikes to the point that the 
>> >>> nodes can't talk to each other, resulting in the inability to view or 
>> >>> modify the flow in the console.
>> >>>
>> >>> I have tried some basic things to mitigate this:
>> >>> - Requested that the sending party use a comma-separated list of all 7 
>> >>> of our nodes in their Remote Process Group that points to our cluster, 
>> >>> in hopes that that will help balance the requests
>> >>> - Requested that the sending party use some of the batching settings on 
>> >>> the Remote Port (i.e., Count = 20, Size = 100 MB, Duration = 10 sec)
>> >>> - Reduced the thread count on our Input Port to 2
>> >>>
>> >>> Are there any known nifi.properties that can be set to help mitigate 
>> >>> this problem?  Again, it only seems to be a problem when we are both 
>> >>> receiving site-to-site traffic and doing our normal processing, but 
>> >>> taking each of those activities in isolation seems to be okay.
>> >>>
>> >>> Thanks,
>> >>> Joe
>> >>
>> >>
>> >>
>> >> --
>> >> I know what it is to be in need, and I know what it is to have plenty.  I 
>> >> have learned the secret of being content in any and every situation, 
>> >> whether well fed or hungry, whether living in plenty or in want.  I can 
>> >> do all this through him who gives me strength.-Philippians 4:12-13
>> >
>> >


Re: NiFi 1.6.0 cluster stability with Site-to-Site

2018-08-10 Thread Mark Payne
Joe G,

Also, to clarify, when you say "when we add receiving Site-to-Site traffic to 
the mix, the CPU spikes to the point that the nodes can't talk to each other, 
resulting in the inability to view or modify the flow in the console"
what exactly does "when we add receiving Site-to-stie traffic to the mix" mean? 
Does that mean adding an Input Port to your canvas' Root Group? Does it mean 
starting the Input Port? Or simply having the sender
start transmitting data? Does it mean creating a Remote Process Group on your 
canvas? Trying to understand the exact action that is being taken here.

The reason that I ask is that there was some refactoring of the component 
lifecycles in 1.6.0. That caused Funnels that were not fully connected to start 
using a huge amount of CPU.
That was addressed in NIFI-5075 [1]. I'm wondering if perhaps you've stumbled 
across something similar, related to Root Group Ports or RPG Ports

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-5075


On Aug 10, 2018, at 5:07 PM, Joe Witt 
mailto:joe.w...@gmail.com>> wrote:

Yep what Mike points to is exactly what I was thinking of.  Since
you're on 1.6.0 then probably the issue is something else.  1.6
included an updated jersey client or something related to that.  Its
performance was really bad for our case.  In 1.7.0 it was replaced
with an implementation leveraging okhttp.  This may be and important
factor.

thanks
On Fri, Aug 10, 2018 at 5:02 PM Michael Moser 
mailto:moser...@gmail.com>> wrote:

When I read this I thought of NIFI-4598 [1] and this may be what Joe remembers, 
too.  If your site-to-site clients are older than 1.5.0, then maybe this is a 
factor?

[1] - https://issues.apache.org/jira/browse/NIFI-4598

-- Mike


On Fri, Aug 10, 2018 at 4:43 PM Joe Witt 
mailto:joe.w...@gmail.com>> wrote:

Joe G

I do recall there were some fixes and improvements related to
clustering performance/thread pooling/ as it relates to site to site.
I dont recall precisely which version they went into but i'd strongly
recommend trying the latest release if you're able.

Thanks
On Fri, Aug 10, 2018 at 4:13 PM Martijn Dekkers 
mailto:mart...@dekkers.org.uk>> wrote:

Whats the OS you are running on? What kind of systems? Memory stats, network 
stats, JVM stats etc. How much data coming through?

On 10 August 2018 at 16:06, Joe Gresock 
mailto:jgres...@gmail.com>> wrote:

Any nifi developers on this list that have any suggestions?

On Wed, Aug 8, 2018 at 7:38 AM Joe Gresock 
mailto:jgres...@gmail.com>> wrote:

I am running a 7-node NiFi 1.6.0 cluster that performs fairly well when it's 
simply processing its own data (putting records in Elasticsearch, MongoDB, 
running transforms, etc.).  However, when we add receiving Site-to-Site traffic 
to the mix, the CPU spikes to the point that the nodes can't talk to each 
other, resulting in the inability to view or modify the flow in the console.

I have tried some basic things to mitigate this:
- Requested that the sending party use a comma-separated list of all 7 of our 
nodes in their Remote Process Group that points to our cluster, in hopes that 
that will help balance the requests
- Requested that the sending party use some of the batching settings on the 
Remote Port (i.e., Count = 20, Size = 100 MB, Duration = 10 sec)
- Reduced the thread count on our Input Port to 2

Are there any known nifi.properties that can be set to help mitigate this 
problem?  Again, it only seems to be a problem when we are both receiving 
site-to-site traffic and doing our normal processing, but taking each of those 
activities in isolation seems to be okay.

Thanks,
Joe



--
I know what it is to be in need, and I know what it is to have plenty.  I have 
learned the secret of being content in any and every situation, whether well 
fed or hungry, whether living in plenty or in want.  I can do all this through 
him who gives me strength.-Philippians 4:12-13





After 1.7.1 upgrade, no Provenance data is visible

2018-08-10 Thread Peter Wicks (pwicks)
After upgrading our NiFi instances to 1.7.1 we are not able to see Provenance 
data anymore in the UI. We see this across about a dozen instances.
In the UI it tells me provenance is available for about the last 24 hours, and 
I can see that files have moved in and out of the processor in the last 5 min. 
In the logs, I can see it query provenance, and that the query returns 0 
results.

Thoughts? I saw a few tickets related to Provenance in 1.7, but not sure if 
they have an impact.

Here are our properties:

# Provenance Repository Properties
nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
nifi.provenance.repository.debug.frequency=1_000_000
nifi.provenance.repository.encryption.key.provider.implementation=
nifi.provenance.repository.encryption.key.provider.location=
nifi.provenance.repository.encryption.key.id=
nifi.provenance.repository.encryption.key=

# Persistent Provenance Repository Properties
nifi.provenance.repository.directory.default=/data/nifi/repositories/provenance_repository
nifi.provenance.repository.max.storage.time=24 hours
nifi.provenance.repository.max.storage.size=1 GB
nifi.provenance.repository.rollover.time=30 secs
nifi.provenance.repository.rollover.size=100 MB
nifi.provenance.repository.query.threads=2
nifi.provenance.repository.index.threads=2
nifi.provenance.repository.compress.on.rollover=true
nifi.provenance.repository.always.sync=false
nifi.provenance.repository.journal.count=16
# Comma-separated list of fields. Fields that are not indexed will not be 
searchable. Valid fields are:
# EventType, FlowFileUUID, Filename, TransitURI, ProcessorID, 
AlternateIdentifierURI, Relationship, Details
nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, 
ProcessorID, Relationship
# FlowFile Attributes that should be indexed and made searchable.  Some 
examples to consider are filename, uuid, mime.type
nifi.provenance.repository.indexed.attributes=
# Large values for the shard size will result in more Java heap usage when 
searching the Provenance Repository
# but should provide better performance
nifi.provenance.repository.index.shard.size=500 MB
# Indicates the maximum length that a FlowFile attribute can be when retrieving 
a Provenance Event from
# the repository. If the length of any attribute exceeds this value, it will be 
truncated when the event is retrieved.
nifi.provenance.repository.max.attribute.length=65536
nifi.provenance.repository.concurrent.merge.threads=2
nifi.provenance.repository.warm.cache.frequency=1 hour

Thanks,
  Peter


Re: After 1.7.1 upgrade, no Provenance data is visible

2018-08-10 Thread Michael Moser
Hi Peter,

There was a change to provenance related access policies in 1.7.0.  Check
out the Migration Guide [1] for 1.7.0.  It talks about what you'll need to
do.

[1] - https://cwiki.apache.org/confluence/display/NIFI/Migration+Guidance

-- Mike


On Fri, Aug 10, 2018 at 5:39 PM Peter Wicks (pwicks) 
wrote:

> After upgrading our NiFi instances to 1.7.1 we are not able to see
> Provenance data anymore in the UI. We see this across about a dozen
> instances.
>
> In the UI it tells me provenance is available for about the last 24 hours,
> and I can see that files have moved in and out of the processor in the last
> 5 min. In the logs, I can see it query provenance, and that the query
> returns 0 results.
>
>
>
> Thoughts? I saw a few tickets related to Provenance in 1.7, but not sure
> if they have an impact.
>
>
>
> Here are our properties:
>
>
>
> # Provenance Repository Properties
>
>
> nifi.provenance.repository.implementation=org.apache.nifi.provenance.WriteAheadProvenanceRepository
>
> nifi.provenance.repository.debug.frequency=1_000_000
>
> nifi.provenance.repository.encryption.key.provider.implementation=
>
> nifi.provenance.repository.encryption.key.provider.location=
>
> nifi.provenance.repository.encryption.key.id=
>
> nifi.provenance.repository.encryption.key=
>
>
>
> # Persistent Provenance Repository Properties
>
>
> nifi.provenance.repository.directory.default=/data/nifi/repositories/provenance_repository
>
> nifi.provenance.repository.max.storage.time=24 hours
>
> nifi.provenance.repository.max.storage.size=1 GB
>
> nifi.provenance.repository.rollover.time=30 secs
>
> nifi.provenance.repository.rollover.size=100 MB
>
> nifi.provenance.repository.query.threads=2
>
> nifi.provenance.repository.index.threads=2
>
> nifi.provenance.repository.compress.on.rollover=true
>
> nifi.provenance.repository.always.sync=false
>
> nifi.provenance.repository.journal.count=16
>
> # Comma-separated list of fields. Fields that are not indexed will not be
> searchable. Valid fields are:
>
> # EventType, FlowFileUUID, Filename, TransitURI, ProcessorID,
> AlternateIdentifierURI, Relationship, Details
>
> nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID,
> Filename, ProcessorID, Relationship
>
> # FlowFile Attributes that should be indexed and made searchable.  Some
> examples to consider are filename, uuid, mime.type
>
> nifi.provenance.repository.indexed.attributes=
>
> # Large values for the shard size will result in more Java heap usage when
> searching the Provenance Repository
>
> # but should provide better performance
>
> nifi.provenance.repository.index.shard.size=500 MB
>
> # Indicates the maximum length that a FlowFile attribute can be when
> retrieving a Provenance Event from
>
> # the repository. If the length of any attribute exceeds this value, it
> will be truncated when the event is retrieved.
>
> nifi.provenance.repository.max.attribute.length=65536
>
> nifi.provenance.repository.concurrent.merge.threads=2
>
> nifi.provenance.repository.warm.cache.frequency=1 hour
>
>
>
> Thanks,
>
>   Peter
>