Merge multiple flowfiles

2016-06-01 Thread Huagen peng
Hi,

In the data flow I am dealing with now, there are multiple (up to 200) logs 
associated with a given hour.  I need to process these fragment hourly logs and 
then concatenate them into a single file.  The approach I am using now has an 
UpdateAttribute processor to set an arbitrary segment.original.filename 
attribute on all the flowfiles I want to merge.  Then I use a MergeContent 
processor, with an UpdateAttribute and RouteOnAttribute processor to form a 
loop to confirm five times that the merge is complete.  Even with that, I 
occasionally still get more than one merged flowfile.

Is there a better way to do this?  Or should I increase the loop count, say 10?

Thanks.

Huagen  

Re: Merge multiple flowfiles

2016-06-02 Thread Andy LoPresto
Huagen,

Sorry, I am a little confused. My understanding is that you want to combine n 
individual logs (each with a respective flowfile) from a specific hour into a 
single file. What is confusing is when you say “Even with that [a 5* 
confirmation loop], I occasionally still get more than one merged flowfile.” Do 
you mean that what you expected to be combined into a single flowfile is output 
as two distinct and incomplete flowfiles?

Without seeing a template of your work flow, I can make a couple of suggestions.

First, as mentioned last night by James Wing, I would encourage you to look at 
the MergeContent [1] processor properties to provide a high threshold for 
merging flowfiles. If you know the number of log files per hour a priori, you 
can set that as the “Minimum Number of Entries” and ensure that output will 
wait until that many flowfiles have been accumulated.

Also, given that you have described a “loop”, I would imagine you may have 
multiple connections feeding into MergeContent. MergeContent can have 
unexpected behavior with multiple incoming connections, and so I would 
recommend adding a Funnel to aggregate all incoming connections and provide a 
single incoming connection to MergeContent.

Please let us know if this helps, and if not, please share a template and some 
sample input if possible. Thanks.

[1] 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.MergeContent/index.html


Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Jun 1, 2016, at 11:52 AM, Huagen peng  wrote:
> 
> Hi,
> 
> In the data flow I am dealing with now, there are multiple (up to 200) logs 
> associated with a given hour.  I need to process these fragment hourly logs 
> and then concatenate them into a single file.  The approach I am using now 
> has an UpdateAttribute processor to set an arbitrary 
> segment.original.filename attribute on all the flowfiles I want to merge.  
> Then I use a MergeContent processor, with an UpdateAttribute and 
> RouteOnAttribute processor to form a loop to confirm five times that the 
> merge is complete.  Even with that, I occasionally still get more than one 
> merged flowfile.
> 
> Is there a better way to do this?  Or should I increase the loop count, say 
> 10?
> 
> Thanks.
> 
> Huagen



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Merge multiple flowfiles

2016-06-02 Thread Huagen peng
Thanks for the reply, Andy.

I ended up abandoning my previous approach and using ExecuteStreamCommand to 
output (with zcat command on GZ files) all the files I want to concatenate.  
Then performing some data manipulation and saving the file.

Huagen

> 在 2016年6月3日,上午12:29,Andy LoPresto  写道:
> 
> Huagen, 
> 
> Sorry, I am a little confused. My understanding is that you want to combine n 
> individual logs (each with a respective flowfile) from a specific hour into a 
> single file. What is confusing is when you say “Even with that [a 5* 
> confirmation loop], I occasionally still get more than one merged flowfile.” 
> Do you mean that what you expected to be combined into a single flowfile is 
> output as two distinct and incomplete flowfiles? 
> 
> Without seeing a template of your work flow, I can make a couple of 
> suggestions. 
> 
> First, as mentioned last night by James Wing, I would encourage you to look 
> at the MergeContent [1] processor properties to provide a high threshold for 
> merging flowfiles. If you know the number of log files per hour a priori, you 
> can set that as the “Minimum Number of Entries” and ensure that output will 
> wait until that many flowfiles have been accumulated. 
> 
> Also, given that you have described a “loop”, I would imagine you may have 
> multiple connections feeding into MergeContent. MergeContent can have 
> unexpected behavior with multiple incoming connections, and so I would 
> recommend adding a Funnel to aggregate all incoming connections and provide a 
> single incoming connection to MergeContent. 
> 
> Please let us know if this helps, and if not, please share a template and 
> some sample input if possible. Thanks. 
> 
> [1] 
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.MergeContent/index.html
>  
> 
> 
> 
> Andy LoPresto
> alopre...@apache.org 
> alopresto.apa...@gmail.com 
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> 
>> On Jun 1, 2016, at 11:52 AM, Huagen peng > > wrote:
>> 
>> Hi,
>> 
>> In the data flow I am dealing with now, there are multiple (up to 200) logs 
>> associated with a given hour.  I need to process these fragment hourly logs 
>> and then concatenate them into a single file.  The approach I am using now 
>> has an UpdateAttribute processor to set an arbitrary 
>> segment.original.filename attribute on all the flowfiles I want to merge.  
>> Then I use a MergeContent processor, with an UpdateAttribute and 
>> RouteOnAttribute processor to form a loop to confirm five times that the 
>> merge is complete.  Even with that, I occasionally still get more than one 
>> merged flowfile.
>> 
>> Is there a better way to do this?  Or should I increase the loop count, say 
>> 10?
>> 
>> Thanks.
>> 
>> Huagen  
> 



Re: Merge multiple flowfiles

2016-06-03 Thread Oleg Zhurakousky
Huge

Just to close the loop on this one, I also wanted to point out this JIRA 
https://issues.apache.org/jira/browse/NIFI-1926 for general purpose aggregation 
processor which indeed would support multiple connections, configurable 
aggregation, release and correlation strategies.
It would be nice if you can describe your use case in that JIRA, so we can 
start gathering these use cases.

Cheers
Oleg

On Jun 3, 2016, at 2:33 AM, Huagen peng 
mailto:huagen.p...@gmail.com>> wrote:

Thanks for the reply, Andy.

I ended up abandoning my previous approach and using ExecuteStreamCommand to 
output (with zcat command on GZ files) all the files I want to concatenate.  
Then performing some data manipulation and saving the file.

Huagen

在 2016年6月3日,上午12:29,Andy LoPresto 
mailto:alopre...@apache.org>> 写道:

Huagen,

Sorry, I am a little confused. My understanding is that you want to combine n 
individual logs (each with a respective flowfile) from a specific hour into a 
single file. What is confusing is when you say “Even with that [a 5* 
confirmation loop], I occasionally still get more than one merged flowfile.” Do 
you mean that what you expected to be combined into a single flowfile is output 
as two distinct and incomplete flowfiles?

Without seeing a template of your work flow, I can make a couple of suggestions.

First, as mentioned last night by James Wing, I would encourage you to look at 
the MergeContent [1] processor properties to provide a high threshold for 
merging flowfiles. If you know the number of log files per hour a priori, you 
can set that as the “Minimum Number of Entries” and ensure that output will 
wait until that many flowfiles have been accumulated.

Also, given that you have described a “loop”, I would imagine you may have 
multiple connections feeding into MergeContent. MergeContent can have 
unexpected behavior with multiple incoming connections, and so I would 
recommend adding a Funnel to aggregate all incoming connections and provide a 
single incoming connection to MergeContent.

Please let us know if this helps, and if not, please share a template and some 
sample input if possible. Thanks.

[1] 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.MergeContent/index.html


Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Jun 1, 2016, at 11:52 AM, Huagen peng 
mailto:huagen.p...@gmail.com>> wrote:

Hi,

In the data flow I am dealing with now, there are multiple (up to 200) logs 
associated with a given hour.  I need to process these fragment hourly logs and 
then concatenate them into a single file.  The approach I am using now has an 
UpdateAttribute processor to set an arbitrary segment.original.filename 
attribute on all the flowfiles I want to merge.  Then I use a MergeContent 
processor, with an UpdateAttribute and RouteOnAttribute processor to form a 
loop to confirm five times that the merge is complete.  Even with that, I 
occasionally still get more than one merged flowfile.

Is there a better way to do this?  Or should I increase the loop count, say 10?

Thanks.

Huagen





Re: Merge multiple flowfiles

2016-06-03 Thread Oleg Zhurakousky
Huagen,
I also want to apologize for my spell-checker butchering your name ;)

Cheers
Oleg

On Jun 3, 2016, at 8:03 AM, Oleg Zhurakousky 
mailto:ozhurakou...@hortonworks.com>> wrote:

Huge

Just to close the loop on this one, I also wanted to point out this JIRA 
https://issues.apache.org/jira/browse/NIFI-1926 for general purpose aggregation 
processor which indeed would support multiple connections, configurable 
aggregation, release and correlation strategies.
It would be nice if you can describe your use case in that JIRA, so we can 
start gathering these use cases.

Cheers
Oleg

On Jun 3, 2016, at 2:33 AM, Huagen peng 
mailto:huagen.p...@gmail.com>> wrote:

Thanks for the reply, Andy.

I ended up abandoning my previous approach and using ExecuteStreamCommand to 
output (with zcat command on GZ files) all the files I want to concatenate.  
Then performing some data manipulation and saving the file.

Huagen

在 2016年6月3日,上午12:29,Andy LoPresto 
mailto:alopre...@apache.org>> 写道:

Huagen,

Sorry, I am a little confused. My understanding is that you want to combine n 
individual logs (each with a respective flowfile) from a specific hour into a 
single file. What is confusing is when you say “Even with that [a 5* 
confirmation loop], I occasionally still get more than one merged flowfile.” Do 
you mean that what you expected to be combined into a single flowfile is output 
as two distinct and incomplete flowfiles?

Without seeing a template of your work flow, I can make a couple of suggestions.

First, as mentioned last night by James Wing, I would encourage you to look at 
the MergeContent [1] processor properties to provide a high threshold for 
merging flowfiles. If you know the number of log files per hour a priori, you 
can set that as the “Minimum Number of Entries” and ensure that output will 
wait until that many flowfiles have been accumulated.

Also, given that you have described a “loop”, I would imagine you may have 
multiple connections feeding into MergeContent. MergeContent can have 
unexpected behavior with multiple incoming connections, and so I would 
recommend adding a Funnel to aggregate all incoming connections and provide a 
single incoming connection to MergeContent.

Please let us know if this helps, and if not, please share a template and some 
sample input if possible. Thanks.

[1] 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.MergeContent/index.html


Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Jun 1, 2016, at 11:52 AM, Huagen peng 
mailto:huagen.p...@gmail.com>> wrote:

Hi,

In the data flow I am dealing with now, there are multiple (up to 200) logs 
associated with a given hour.  I need to process these fragment hourly logs and 
then concatenate them into a single file.  The approach I am using now has an 
UpdateAttribute processor to set an arbitrary segment.original.filename 
attribute on all the flowfiles I want to merge.  Then I use a MergeContent 
processor, with an UpdateAttribute and RouteOnAttribute processor to form a 
loop to confirm five times that the merge is complete.  Even with that, I 
occasionally still get more than one merged flowfile.

Is there a better way to do this?  Or should I increase the loop count, say 10?

Thanks.

Huagen