Re: PutKafka use with large quantity of data?

2019-04-04 Thread Bryan Bende
Each queue has back-pressure settings on it which default to 10k flow
files or 1GB size of flow files. When one of these thresholds is
exceeded, the preceeding processor will not execute until the queue
goes back below the threshold.

Most likely if GenerateFlowFile has a Run Schedule of 0 seconds, then
it is filling up the queue faster than PutKafka can take flow files
out of the queue. This isn't necessarily wrong, they will just be in a
constant race to add and remove from the queue.

You can do one of several things if it makes sense... increase the
settings on the queue to higher thresholds (if it makes sense based on
your available JVM heap), increase the concurrent tasks on the
PutKafka processor so more than one thread is taking stuff out of the
queue, slightly slow down GenerateFlowFile by using a Run Schedule of
10 ms.

On Thu, Apr 4, 2019 at 11:43 AM l vic  wrote:
>
> Actually, it's not Kafka topic but Nifi queue between "generateFlow" and 
> "PutKafka" gets overflown
>
> On Thu, Apr 4, 2019 at 10:58 AM Joe Witt  wrote:
>>
>> Can you share screenshots, logs, and a more detailed description of what 
>> you're doing, observing with nifi and the system and what you expect it to 
>> be doing.
>>
>> Thanks
>>
>> On Thu, Apr 4, 2019 at 10:56 AM l vic  wrote:
>>>
>>> No, actually what happens is - NiFi stops responding ( if I use it without 
>>> rate contol)
>>>
>>>
>>> On Thu, Apr 4, 2019 at 10:42 AM Joe Witt  wrote:

 Hello

 There isn't really a feedback mechanism based on load on the Kafka topic.  
 When you say overrunning the topic do you mean that you don't want there 
 to be a large lag between consumers and their current offset and if that 
 grows you want NiFi to slow down?  I dont believe there is anything 
 inherent to the kafka producer protocol that would inform of us this.  We 
 could periodically poll for this information and optional back-off.

 Is this what you have in mind?

 Thanks

 On Thu, Apr 4, 2019 at 10:34 AM l vic  wrote:
>
> I have to ingest large (200,000 messages) data set into Kafka topic as 
> quickly as possible without overrunning topic... Right now I just use 
> rate limiter to do it but can be there some better "adaptive" way to do 
> it?
> Thank you...
> -V


Re: PutKafka use with large quantity of data?

2019-04-04 Thread l vic
Actually, it's not Kafka topic but Nifi queue between "generateFlow" and
"PutKafka" gets overflown

On Thu, Apr 4, 2019 at 10:58 AM Joe Witt  wrote:

> Can you share screenshots, logs, and a more detailed description of what
> you're doing, observing with nifi and the system and what you expect it to
> be doing.
>
> Thanks
>
> On Thu, Apr 4, 2019 at 10:56 AM l vic  wrote:
>
>> No, actually what happens is - NiFi stops responding ( if I use it
>> without rate contol)
>>
>>
>> On Thu, Apr 4, 2019 at 10:42 AM Joe Witt  wrote:
>>
>>> Hello
>>>
>>> There isn't really a feedback mechanism based on load on the Kafka
>>> topic.  When you say overrunning the topic do you mean that you don't want
>>> there to be a large lag between consumers and their current offset and if
>>> that grows you want NiFi to slow down?  I dont believe there is anything
>>> inherent to the kafka producer protocol that would inform of us this.  We
>>> could periodically poll for this information and optional back-off.
>>>
>>> Is this what you have in mind?
>>>
>>> Thanks
>>>
>>> On Thu, Apr 4, 2019 at 10:34 AM l vic  wrote:
>>>
 I have to ingest large (200,000 messages) data set into Kafka topic as
 quickly as possible without overrunning topic... Right now I just use rate
 limiter to do it but can be there some better "adaptive" way to do it?
 Thank you...
 -V

>>>


Re: PutKafka use with large quantity of data?

2019-04-04 Thread Bryan Bende
We need to define what "NiFi stops responding" means...

Are there tons of flow files queued up before before PublishKafka?
Are there back-pressure indicators on any of the queues?
Do the kafka related processors show active threads in the top right
corners of the processors?
Does NiFI crash?

On Thu, Apr 4, 2019 at 11:28 AM Andrew Grande  wrote:
>
> What's the concurrency for these processors? What's a global NiFi thread pool 
> size?
>
> I wonder if you might be running out of available threads while they are 
> waiting for external system i/o under load.
>
> Andrew
>
> On Thu, Apr 4, 2019, 8:24 AM l vic  wrote:
>>
>> What's this particular processing group does: writes large dataset to Kafka 
>> topic, one consumer reads from topic and saves data to Hbase/PQS table, 
>> another consumer writes to ES index
>>
>> On Thu, Apr 4, 2019 at 10:58 AM Joe Witt  wrote:
>>>
>>> Can you share screenshots, logs, and a more detailed description of what 
>>> you're doing, observing with nifi and the system and what you expect it to 
>>> be doing.
>>>
>>> Thanks
>>>
>>> On Thu, Apr 4, 2019 at 10:56 AM l vic  wrote:

 No, actually what happens is - NiFi stops responding ( if I use it without 
 rate contol)


 On Thu, Apr 4, 2019 at 10:42 AM Joe Witt  wrote:
>
> Hello
>
> There isn't really a feedback mechanism based on load on the Kafka topic. 
>  When you say overrunning the topic do you mean that you don't want there 
> to be a large lag between consumers and their current offset and if that 
> grows you want NiFi to slow down?  I dont believe there is anything 
> inherent to the kafka producer protocol that would inform of us this.  We 
> could periodically poll for this information and optional back-off.
>
> Is this what you have in mind?
>
> Thanks
>
> On Thu, Apr 4, 2019 at 10:34 AM l vic  wrote:
>>
>> I have to ingest large (200,000 messages) data set into Kafka topic as 
>> quickly as possible without overrunning topic... Right now I just use 
>> rate limiter to do it but can be there some better "adaptive" way to do 
>> it?
>> Thank you...
>> -V


Re: PutKafka use with large quantity of data?

2019-04-04 Thread Andrew Grande
What's the concurrency for these processors? What's a global NiFi thread
pool size?

I wonder if you might be running out of available threads while they are
waiting for external system i/o under load.

Andrew

On Thu, Apr 4, 2019, 8:24 AM l vic  wrote:

> What's this particular processing group does: writes large dataset to
> Kafka topic, one consumer reads from topic and saves data to Hbase/PQS
> table, another consumer writes to ES index
>
> On Thu, Apr 4, 2019 at 10:58 AM Joe Witt  wrote:
>
>> Can you share screenshots, logs, and a more detailed description of what
>> you're doing, observing with nifi and the system and what you expect it to
>> be doing.
>>
>> Thanks
>>
>> On Thu, Apr 4, 2019 at 10:56 AM l vic  wrote:
>>
>>> No, actually what happens is - NiFi stops responding ( if I use it
>>> without rate contol)
>>>
>>>
>>> On Thu, Apr 4, 2019 at 10:42 AM Joe Witt  wrote:
>>>
 Hello

 There isn't really a feedback mechanism based on load on the Kafka
 topic.  When you say overrunning the topic do you mean that you don't want
 there to be a large lag between consumers and their current offset and if
 that grows you want NiFi to slow down?  I dont believe there is anything
 inherent to the kafka producer protocol that would inform of us this.  We
 could periodically poll for this information and optional back-off.

 Is this what you have in mind?

 Thanks

 On Thu, Apr 4, 2019 at 10:34 AM l vic  wrote:

> I have to ingest large (200,000 messages) data set into Kafka topic as
> quickly as possible without overrunning topic... Right now I just use rate
> limiter to do it but can be there some better "adaptive" way to do it?
> Thank you...
> -V
>



Re: PutKafka use with large quantity of data?

2019-04-04 Thread l vic
What's this particular processing group does: writes large dataset to Kafka
topic, one consumer reads from topic and saves data to Hbase/PQS table,
another consumer writes to ES index

On Thu, Apr 4, 2019 at 10:58 AM Joe Witt  wrote:

> Can you share screenshots, logs, and a more detailed description of what
> you're doing, observing with nifi and the system and what you expect it to
> be doing.
>
> Thanks
>
> On Thu, Apr 4, 2019 at 10:56 AM l vic  wrote:
>
>> No, actually what happens is - NiFi stops responding ( if I use it
>> without rate contol)
>>
>>
>> On Thu, Apr 4, 2019 at 10:42 AM Joe Witt  wrote:
>>
>>> Hello
>>>
>>> There isn't really a feedback mechanism based on load on the Kafka
>>> topic.  When you say overrunning the topic do you mean that you don't want
>>> there to be a large lag between consumers and their current offset and if
>>> that grows you want NiFi to slow down?  I dont believe there is anything
>>> inherent to the kafka producer protocol that would inform of us this.  We
>>> could periodically poll for this information and optional back-off.
>>>
>>> Is this what you have in mind?
>>>
>>> Thanks
>>>
>>> On Thu, Apr 4, 2019 at 10:34 AM l vic  wrote:
>>>
 I have to ingest large (200,000 messages) data set into Kafka topic as
 quickly as possible without overrunning topic... Right now I just use rate
 limiter to do it but can be there some better "adaptive" way to do it?
 Thank you...
 -V

>>>


Re: PutKafka use with large quantity of data?

2019-04-04 Thread Joe Witt
Can you share screenshots, logs, and a more detailed description of what
you're doing, observing with nifi and the system and what you expect it to
be doing.

Thanks

On Thu, Apr 4, 2019 at 10:56 AM l vic  wrote:

> No, actually what happens is - NiFi stops responding ( if I use it without
> rate contol)
>
>
> On Thu, Apr 4, 2019 at 10:42 AM Joe Witt  wrote:
>
>> Hello
>>
>> There isn't really a feedback mechanism based on load on the Kafka
>> topic.  When you say overrunning the topic do you mean that you don't want
>> there to be a large lag between consumers and their current offset and if
>> that grows you want NiFi to slow down?  I dont believe there is anything
>> inherent to the kafka producer protocol that would inform of us this.  We
>> could periodically poll for this information and optional back-off.
>>
>> Is this what you have in mind?
>>
>> Thanks
>>
>> On Thu, Apr 4, 2019 at 10:34 AM l vic  wrote:
>>
>>> I have to ingest large (200,000 messages) data set into Kafka topic as
>>> quickly as possible without overrunning topic... Right now I just use rate
>>> limiter to do it but can be there some better "adaptive" way to do it?
>>> Thank you...
>>> -V
>>>
>>


Re: PutKafka use with large quantity of data?

2019-04-04 Thread l vic
No, actually what happens is - NiFi stops responding ( if I use it without
rate contol)


On Thu, Apr 4, 2019 at 10:42 AM Joe Witt  wrote:

> Hello
>
> There isn't really a feedback mechanism based on load on the Kafka topic.
> When you say overrunning the topic do you mean that you don't want there to
> be a large lag between consumers and their current offset and if that grows
> you want NiFi to slow down?  I dont believe there is anything inherent to
> the kafka producer protocol that would inform of us this.  We could
> periodically poll for this information and optional back-off.
>
> Is this what you have in mind?
>
> Thanks
>
> On Thu, Apr 4, 2019 at 10:34 AM l vic  wrote:
>
>> I have to ingest large (200,000 messages) data set into Kafka topic as
>> quickly as possible without overrunning topic... Right now I just use rate
>> limiter to do it but can be there some better "adaptive" way to do it?
>> Thank you...
>> -V
>>
>


Re: PutKafka use with large quantity of data?

2019-04-04 Thread Joe Witt
Hello

There isn't really a feedback mechanism based on load on the Kafka topic.
When you say overrunning the topic do you mean that you don't want there to
be a large lag between consumers and their current offset and if that grows
you want NiFi to slow down?  I dont believe there is anything inherent to
the kafka producer protocol that would inform of us this.  We could
periodically poll for this information and optional back-off.

Is this what you have in mind?

Thanks

On Thu, Apr 4, 2019 at 10:34 AM l vic  wrote:

> I have to ingest large (200,000 messages) data set into Kafka topic as
> quickly as possible without overrunning topic... Right now I just use rate
> limiter to do it but can be there some better "adaptive" way to do it?
> Thank you...
> -V
>


PutKafka use with large quantity of data?

2019-04-04 Thread l vic
I have to ingest large (200,000 messages) data set into Kafka topic as
quickly as possible without overrunning topic... Right now I just use rate
limiter to do it but can be there some better "adaptive" way to do it?
Thank you...
-V


Re: Reusing same flow for different database connections

2019-04-04 Thread Bryan Bende
Hello,

Take a look at the DBCP lookup service, it allows you to register one or
more connection pool services and then select one at runtime based on an
incoming flow file having an attribute called database.name.

Thanks,

Bryan

On Thu, Apr 4, 2019 at 8:47 AM Max  wrote:

> Hello!
>
> We are working on a project that requires importing data from tables
> across different database servers (as in, different db connection pools.)
>
> The data flow itself is the same across maybe 40-50 tables and around 10
> connections. I tried to create an abstract flow that can be parameterized
> by an incoming flow file in order to avoid duplication. My flow works like
> this:
>
> - A .json file is written to a folder. It contains something like this:
> {"target": "some_table", "source": "other_table", "query": "SELECT * FROM
> other_table", ...}
> - We convert the json key/value pairs to flow file attributes
> - Select the data as records from the source table using the query
> attribute
> - Store the data in bulk in the target table
>
> This works well since we can use the parameters from the .json file in the
> processors of the flow, I don't need to hardcode table names or the query
> in the processors.
>
> Where this approach breaks down: I can't parameterize the database
> connection/the connection pool name. So in the end I would need to
> duplicate the same flow 10x for each database.
>
> Maybe I'm approaching this from the wrong direction, is there a better way
> to do what I want?
>
> Max
>
-- 
Sent from Gmail Mobile


Reusing same flow for different database connections

2019-04-04 Thread Max
Hello!

We are working on a project that requires importing data from tables across
different database servers (as in, different db connection pools.)

The data flow itself is the same across maybe 40-50 tables and around 10
connections. I tried to create an abstract flow that can be parameterized
by an incoming flow file in order to avoid duplication. My flow works like
this:

- A .json file is written to a folder. It contains something like this:
{"target": "some_table", "source": "other_table", "query": "SELECT * FROM
other_table", ...}
- We convert the json key/value pairs to flow file attributes
- Select the data as records from the source table using the query attribute
- Store the data in bulk in the target table

This works well since we can use the parameters from the .json file in the
processors of the flow, I don't need to hardcode table names or the query
in the processors.

Where this approach breaks down: I can't parameterize the database
connection/the connection pool name. So in the end I would need to
duplicate the same flow 10x for each database.

Maybe I'm approaching this from the wrong direction, is there a better way
to do what I want?

Max


GetHbase state

2019-04-04 Thread Dwane Hall
Hey fellow NiFi fans,

I was recently loading data into into Solr via HBase (around 700G ~60,000,000 
db rows) using NiFi and noticed some inconsistent behaviour with the GetHbase 
processor and I'm wondering if anyone else has noticed similar behaviour when 
using it.

Here's our environment and the high level workflow we were attempting:

Apache NiFi 1.8
Two node cluster (external zookeeper maintaining processor state)
HBase 1.1.2

Step 1 We execute an external Pig job to load data into a HBase table.
Step 2 (NiFi) We use a GetHbase processor listening to the HBase table for new 
data - Execution context set to Primary Node only.
Step 3 (NiFi) Some light attribute addition and eventually the data is stored 
in Solr using PutSolrContentStream.

What we found during our testing is that the GetHBase processor did not appear 
to accurately maintain its state as data was being loaded out of Hbase.  We 
tried a number of load strategies with varying success.


  1.  No concurrency - Wait for all data to be loaded into HBase by the Pig job 
and then dump all 700G of data into NiFi. This was successful as there was no 
state dependency but we lose the ability to run the Solr load and Pig job in 
parallel and dump a lot of data on NiFi at once.
  2.  Concurrent - Pig job and GetHbase processor running concurrently. Here we 
found we missed ~ 30% of the data we were loading into Solr.
  3.  Staged. Load a portion of data into HBase using Pig, start the GetHbase 
processor to load that portion of data and repeat until all data is loaded (Pig 
job and GetHbase processor were never run concurrently).  In this scenario we 
reloaded the same data several times and the state behaved unusually after the 
first run.  Timestamp entries were placed in the processor state but they 
appeared to be ignored on subsequent runs with the data being reloaded several 
times.

We were able to work around the issue using the ScanHbase processor and 
specifying our key range to load the data so it was not a showstopper for us.  
I'm just wondering if any other users in the community have had similar 
experiences using this processor or if I need to revise my local environment 
configuration?

Thanks,

Dwane


Re: NiFi Registry Not Auditing Denied Errors

2019-04-04 Thread Shawn Weeks
It looks like it will do this if you don’t grant the host access to /buckets 
which is a valid resource. 

Sent from my iPhone

> On Apr 4, 2019, at 1:45 AM, Koji Kawamura  wrote:
> 
> Hi Shawn,
> 
> The 'No applicable policies could be found.' message can be logged
> when a request is made against a resource which doesn't exist.
> https://github.com/apache/nifi-registry/blob/master/nifi-registry-core/nifi-registry-framework/src/main/java/org/apache/nifi/registry/security/authorization/resource/Authorizable.java#L236,L247
> 
> If a request for a valid resource, but the user doesn't have right
> permissions, then the log should look like this:
> 2019-04-04 14:34:58,492 INFO [NiFi Registry Web Server-71]
> o.a.n.r.w.m.AccessDeniedExceptionMapper identity[CN=alice, OU=NIFI],
> groups[] does not have permission to access the requested resource.
> Unable to view Bucket with ID b5c0b8d3-44df-4afd-9e4b-114c0e299268.
> Returning Forbidden response.
> 
> Enabling Jetty debug log may be helpful to get more information, but
> lots of noisy logs should be expected.
> E.g. add this entry to conf/logback.xml
> 
> 
> Thanks,
> Koji
> 
>> On Sat, Mar 30, 2019 at 11:58 PM Shawn Weeks  
>> wrote:
>> 
>> I remember seeing something where we reduced the amount of auditing for 
>> access denied errors the NiFi Ranger plugin was doing. On a new installation 
>> with Registry 0.3.0 I’m not seeing any access denied errors at all despite 
>> the app log showing them. It’s making it really hard to figure out what 
>> exactly is failing. I know it’s related to the host access but the error log 
>> doesn’t say what was being accessed.
>> 
>> 
>> 
>> Basically I get log messages like these.
>> 
>> 
>> 
>> 2019-03-30 09:56:54,817 INFO [NiFi Registry Web Server-20] 
>> o.a.n.r.w.m.AccessDeniedExceptionMapper identity[hdp31-df3.dev.example.com], 
>> groups[] does not have permission to access the requested resource. No 
>> applicable policies could be found. Returning Forbidden response.
>> 
>> 
>> 
>> I could just give blanket access to everything but I prefer to be more 
>> precise.
>> 
>> 
>> 
>> Thanks
>> 
>> Shawn Weeks


Re: NiFi Registry Not Auditing Denied Errors

2019-04-04 Thread Koji Kawamura
Hi Shawn,

The 'No applicable policies could be found.' message can be logged
when a request is made against a resource which doesn't exist.
https://github.com/apache/nifi-registry/blob/master/nifi-registry-core/nifi-registry-framework/src/main/java/org/apache/nifi/registry/security/authorization/resource/Authorizable.java#L236,L247

If a request for a valid resource, but the user doesn't have right
permissions, then the log should look like this:
2019-04-04 14:34:58,492 INFO [NiFi Registry Web Server-71]
o.a.n.r.w.m.AccessDeniedExceptionMapper identity[CN=alice, OU=NIFI],
groups[] does not have permission to access the requested resource.
Unable to view Bucket with ID b5c0b8d3-44df-4afd-9e4b-114c0e299268.
Returning Forbidden response.

Enabling Jetty debug log may be helpful to get more information, but
lots of noisy logs should be expected.
E.g. add this entry to conf/logback.xml


Thanks,
Koji

On Sat, Mar 30, 2019 at 11:58 PM Shawn Weeks  wrote:
>
> I remember seeing something where we reduced the amount of auditing for 
> access denied errors the NiFi Ranger plugin was doing. On a new installation 
> with Registry 0.3.0 I’m not seeing any access denied errors at all despite 
> the app log showing them. It’s making it really hard to figure out what 
> exactly is failing. I know it’s related to the host access but the error log 
> doesn’t say what was being accessed.
>
>
>
> Basically I get log messages like these.
>
>
>
> 2019-03-30 09:56:54,817 INFO [NiFi Registry Web Server-20] 
> o.a.n.r.w.m.AccessDeniedExceptionMapper identity[hdp31-df3.dev.example.com], 
> groups[] does not have permission to access the requested resource. No 
> applicable policies could be found. Returning Forbidden response.
>
>
>
> I could just give blanket access to everything but I prefer to be more 
> precise.
>
>
>
> Thanks
>
> Shawn Weeks