Re: Requesting Obscene FlowFile Batch Sizes

2016-09-20 Thread Bryan Bende
Peter,

Does 10k happen to be your swap threshold in nifi.properties by any chance
(it defaults to 20k I believe)?

I suspect the behavior you are seeing could be due to the way swapping
works, but Mark or others could probably confirm.

I found this thread where Mark explained how swapping works with a
background thread, and I believe it still works this way:
http://apache-nifi.1125220.n5.nabble.com/Nifi-amp-Spark-receiver-performance-configuration-td524.html

-Bryan

On Tue, Sep 20, 2016 at 10:22 AM, Peter Wicks (pwicks) 
wrote:

> I’m using JSONToSQL, followed by PutSQL.  I’m using Teradata, which
> supports a special JDBC mode called FastLoad, designed for a minimum of
> 100,000 rows of data per batch.
>
>
>
> What I’m finding is that when PutSQL requests a new batch of FlowFiles
> from the queue, which has over 1 million rows in it, with a batch size of
> 100, it always returns a maximum of 10k.  How can I get my obscenely
> sized batch request to return all the FlowFile’s I’m asking for?
>
>
>
> Thanks,
>
>   Peter
>


Re: Requesting Obscene FlowFile Batch Sizes

2016-09-20 Thread Andy LoPresto
Bryan,

That’s a good point. Would running with a larger Java heap and higher swap 
threshold allow Peter to get larger batches out?

Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Sep 20, 2016, at 1:41 PM, Bryan Bende  wrote:
> 
> Peter,
> 
> Does 10k happen to be your swap threshold in nifi.properties by any chance 
> (it defaults to 20k I believe)?
> 
> I suspect the behavior you are seeing could be due to the way swapping works, 
> but Mark or others could probably confirm.
> 
> I found this thread where Mark explained how swapping works with a background 
> thread, and I believe it still works this way:
> http://apache-nifi.1125220.n5.nabble.com/Nifi-amp-Spark-receiver-performance-configuration-td524.html
>  
> 
> 
> -Bryan
> 
> On Tue, Sep 20, 2016 at 10:22 AM, Peter Wicks (pwicks)  > wrote:
> I’m using JSONToSQL, followed by PutSQL.  I’m using Teradata, which supports 
> a special JDBC mode called FastLoad, designed for a minimum of 100,000 rows 
> of data per batch.
> 
> 
> 
> What I’m finding is that when PutSQL requests a new batch of FlowFiles from 
> the queue, which has over 1 million rows in it, with a batch size of 100, 
> it always returns a maximum of 10k.  How can I get my obscenely sized batch 
> request to return all the FlowFile’s I’m asking for?
> 
> 
> 
> Thanks,
> 
>   Peter
> 
> 



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Requesting Obscene FlowFile Batch Sizes

2016-09-20 Thread Bryan Bende
Andy,

That was my thinking. An easy test might be to bump the threshold up to
100k (increase heap if needed) and see if it starts grabbing 100k every
time.

If it does then I would think it is swapping related, then need to figure
out if you really want to get all 1 million in a single batch, and if
theres enough heap to support that.

-Bryan

On Tue, Sep 20, 2016 at 5:29 PM, Andy LoPresto  wrote:

> Bryan,
>
> That’s a good point. Would running with a larger Java heap and higher swap
> threshold allow Peter to get larger batches out?
>
> Andy LoPresto
> alopre...@apache.org
> *alopresto.apa...@gmail.com *
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Sep 20, 2016, at 1:41 PM, Bryan Bende  wrote:
>
> Peter,
>
> Does 10k happen to be your swap threshold in nifi.properties by any chance
> (it defaults to 20k I believe)?
>
> I suspect the behavior you are seeing could be due to the way swapping
> works, but Mark or others could probably confirm.
>
> I found this thread where Mark explained how swapping works with a
> background thread, and I believe it still works this way:
> http://apache-nifi.1125220.n5.nabble.com/Nifi-amp-Spark-
> receiver-performance-configuration-td524.html
>
> -Bryan
>
> On Tue, Sep 20, 2016 at 10:22 AM, Peter Wicks (pwicks) 
> wrote:
>
>> I’m using JSONToSQL, followed by PutSQL.  I’m using Teradata, which
>> supports a special JDBC mode called FastLoad, designed for a minimum of
>> 100,000 rows of data per batch.
>>
>>
>>
>> What I’m finding is that when PutSQL requests a new batch of FlowFiles
>> from the queue, which has over 1 million rows in it, with a batch size of
>> 100, it always returns a maximum of 10k.  How can I get my obscenely
>> sized batch request to return all the FlowFile’s I’m asking for?
>>
>>
>>
>> Thanks,
>>
>>   Peter
>>
>
>
>


RE: Requesting Obscene FlowFile Batch Sizes

2016-09-20 Thread Peter Wicks (pwicks)
Andy/Bryan,

Thanks for all of the detail, it’s been helpful.
I actually did an experiment this morning where I modified the processor to 
force it to keep calling `get` until it had all 1 million FlowFiles.  Since I 
was calling it sequentially it was able to move files out of swap and into 
active on each request. I was able to retrieve them and process them through, 
which was great until… NiFi tried to move them through provenance.  At that 
point NiFi ran out of memory and fell over (stopped responding).  Right before 
NiFi ran out of memory I received several bulletins related to Provenance being 
written to too quickly, and that it was being slowed down.

I found another solution to my mass insert and got it up and running. Using a 
Teradata JDBC proprietary flag called FastLoadCSV, and a new custom processor, 
I was able to pass in a CSV file to my JDBC driver and get the same result.  In 
this scenario there was just a single FlowFile and everything went smoothly.

Thanks again!

Peter Wicks



From: Bryan Bende [mailto:bbe...@gmail.com]
Sent: Tuesday, September 20, 2016 3:38 PM
To: users@nifi.apache.org
Subject: Re: Requesting Obscene FlowFile Batch Sizes

Andy,

That was my thinking. An easy test might be to bump the threshold up to 100k 
(increase heap if needed) and see if it starts grabbing 100k every time.

If it does then I would think it is swapping related, then need to figure out 
if you really want to get all 1 million in a single batch, and if theres enough 
heap to support that.

-Bryan

On Tue, Sep 20, 2016 at 5:29 PM, Andy LoPresto 
mailto:alopre...@apache.org>> wrote:
Bryan,

That’s a good point. Would running with a larger Java heap and higher swap 
threshold allow Peter to get larger batches out?

Andy LoPresto
alopre...@apache.org<mailto:alopre...@apache.org>
alopresto.apa...@gmail.com<mailto:alopresto.apa...@gmail.com>
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Sep 20, 2016, at 1:41 PM, Bryan Bende 
mailto:bbe...@gmail.com>> wrote:

Peter,

Does 10k happen to be your swap threshold in nifi.properties by any chance (it 
defaults to 20k I believe)?

I suspect the behavior you are seeing could be due to the way swapping works, 
but Mark or others could probably confirm.

I found this thread where Mark explained how swapping works with a background 
thread, and I believe it still works this way:
http://apache-nifi.1125220.n5.nabble.com/Nifi-amp-Spark-receiver-performance-configuration-td524.html

-Bryan

On Tue, Sep 20, 2016 at 10:22 AM, Peter Wicks (pwicks) 
mailto:pwi...@micron.com>> wrote:
I’m using JSONToSQL, followed by PutSQL.  I’m using Teradata, which supports a 
special JDBC mode called FastLoad, designed for a minimum of 100,000 rows of 
data per batch.

What I’m finding is that when PutSQL requests a new batch of FlowFiles from the 
queue, which has over 1 million rows in it, with a batch size of 100, it 
always returns a maximum of 10k.  How can I get my obscenely sized batch 
request to return all the FlowFile’s I’m asking for?

Thanks,
  Peter





Re: Requesting Obscene FlowFile Batch Sizes

2016-09-20 Thread Andy LoPresto
Hi Peter,

Thanks for letting us know you found a solution and for the additional context. 
Provenance performance is a key area of focus in the next couple releases, so 
hopefully we will have that fixed soon. 

Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Sep 20, 2016, at 19:39, Peter Wicks (pwicks)  wrote:
> 
> Andy/Bryan,
>  
> Thanks for all of the detail, it’s been helpful.
> I actually did an experiment this morning where I modified the processor to 
> force it to keep calling `get` until it had all 1 million FlowFiles.  Since I 
> was calling it sequentially it was able to move files out of swap and into 
> active on each request. I was able to retrieve them and process them through, 
> which was great until… NiFi tried to move them through provenance.  At that 
> point NiFi ran out of memory and fell over (stopped responding).  Right 
> before NiFi ran out of memory I received several bulletins related to 
> Provenance being written to too quickly, and that it was being slowed down.
>  
> I found another solution to my mass insert and got it up and running. Using a 
> Teradata JDBC proprietary flag called FastLoadCSV, and a new custom 
> processor, I was able to pass in a CSV file to my JDBC driver and get the 
> same result.  In this scenario there was just a single FlowFile and 
> everything went smoothly.
>  
> Thanks again!
>  
> Peter Wicks
>  
>  
>  
> From: Bryan Bende [mailto:bbe...@gmail.com] 
> Sent: Tuesday, September 20, 2016 3:38 PM
> To: users@nifi.apache.org
> Subject: Re: Requesting Obscene FlowFile Batch Sizes
>  
> Andy,
>  
> That was my thinking. An easy test might be to bump the threshold up to 100k 
> (increase heap if needed) and see if it starts grabbing 100k every time. 
>  
> If it does then I would think it is swapping related, then need to figure out 
> if you really want to get all 1 million in a single batch, and if theres 
> enough heap to support that.
>  
> -Bryan
>  
> On Tue, Sep 20, 2016 at 5:29 PM, Andy LoPresto  wrote:
> Bryan,
>  
> That’s a good point. Would running with a larger Java heap and higher swap 
> threshold allow Peter to get larger batches out?
>  
> Andy LoPresto
> alopre...@apache.org
> alopresto.apa...@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>  
> On Sep 20, 2016, at 1:41 PM, Bryan Bende  wrote:
>  
> Peter,
>  
> Does 10k happen to be your swap threshold in nifi.properties by any chance 
> (it defaults to 20k I believe)?
>  
> I suspect the behavior you are seeing could be due to the way swapping works, 
> but Mark or others could probably confirm.
>  
> I found this thread where Mark explained how swapping works with a background 
> thread, and I believe it still works this way:
> http://apache-nifi.1125220.n5.nabble.com/Nifi-amp-Spark-receiver-performance-configuration-td524.html
>  
> -Bryan
>  
> On Tue, Sep 20, 2016 at 10:22 AM, Peter Wicks (pwicks)  
> wrote:
> I’m using JSONToSQL, followed by PutSQL.  I’m using Teradata, which supports 
> a special JDBC mode called FastLoad, designed for a minimum of 100,000 rows 
> of data per batch.
>  
> What I’m finding is that when PutSQL requests a new batch of FlowFiles from 
> the queue, which has over 1 million rows in it, with a batch size of 100, 
> it always returns a maximum of 10k.  How can I get my obscenely sized batch 
> request to return all the FlowFile’s I’m asking for?
>  
> Thanks,
>   Peter
>  
>  
>  


Re: Requesting Obscene FlowFile Batch Sizes

2016-09-21 Thread Joe Witt
It would buy time but either way it becomes a magic value people have
to know about.  This is not unlike the SplitText scenario where we
recommend doing two-phase splits.  The problem is that for the
ProcessSession we hold information about the flowfiles (not their
content) in memory and the provenance events in memory.  When we're
talking hundreds of thousands or more events in a session that adds up
really quick.  Users should not need to know/worry about this sort of
thing.  We need to have a way to prestage these things to the
respective repositories (provenance/flowfile) so this can go back to
where it belongs as a framework concern.  Easier said that done but a
good goal for us to have.

Peter's use case is a good one to rally around as they way he wanted
it to work is reasonable and intuitive and we should try to make that
happen.

Thanks
Joe

On Tue, Sep 20, 2016 at 5:29 PM, Andy LoPresto  wrote:
> Bryan,
>
> That’s a good point. Would running with a larger Java heap and higher swap
> threshold allow Peter to get larger batches out?
>
> Andy LoPresto
> alopre...@apache.org
> alopresto.apa...@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Sep 20, 2016, at 1:41 PM, Bryan Bende  wrote:
>
> Peter,
>
> Does 10k happen to be your swap threshold in nifi.properties by any chance
> (it defaults to 20k I believe)?
>
> I suspect the behavior you are seeing could be due to the way swapping
> works, but Mark or others could probably confirm.
>
> I found this thread where Mark explained how swapping works with a
> background thread, and I believe it still works this way:
> http://apache-nifi.1125220.n5.nabble.com/Nifi-amp-Spark-receiver-performance-configuration-td524.html
>
> -Bryan
>
> On Tue, Sep 20, 2016 at 10:22 AM, Peter Wicks (pwicks) 
> wrote:
>>
>> I’m using JSONToSQL, followed by PutSQL.  I’m using Teradata, which
>> supports a special JDBC mode called FastLoad, designed for a minimum of
>> 100,000 rows of data per batch.
>>
>>
>>
>> What I’m finding is that when PutSQL requests a new batch of FlowFiles
>> from the queue, which has over 1 million rows in it, with a batch size of
>> 100, it always returns a maximum of 10k.  How can I get my obscenely
>> sized batch request to return all the FlowFile’s I’m asking for?
>>
>>
>>
>> Thanks,
>>
>>   Peter
>
>
>