Giovanni,

In the scenario that you laid out here, the merged FlowFile will not have a 
'dt' attribute
because there are conflicting values for the 'dt' attribute. As a result, the 
attribute is not
carried through.

If it is important to you that this attribute be carried through, you can set 
the "Correlation Attribute Name"
property to 'dt'. This will cause the processor to only bin together FlowFiles 
that have the same value
for the 'dt' attribute. As a result, since there will be no conflicting values 
for the attribute, the merged
FlowFile will also have this attribute.

Thanks
-Mark




> On Nov 29, 2016, at 9:34 AM, Giovanni Lanzani 
> <giovannilanz...@godatadriven.com> wrote:
> 
> Hi Joe,
> 
> I still have troubles following you. 
> 
> Let's assume I have the MergeContent processor with the "Keep only common 
> Attributes" strategy. The flow files are coming in like so:
> 
> ff_1 (attribute dt = 20161120)
> ff_2 (attribute dt = 20161120)
> ff_3 (attribute dt = 20161121)
> ff_4 ((attribute dt = 20161120)
> 
> If my Minimum Number of Entries in MergeContent is set to 4, what dt 
> attribute will the flow file coming out of the MergeContent processor have? 
> 20161120 or 20161121? 
> 
> Or is NiFi capable of waiting to have enough flow files with each unique 
> value of dt before merging? If so, I think the docs could use some help :)
> 
> From what I could see, that dt attribute was gone after the merge, but maybe 
> I'm doing it wrong.
> 
> Cheers,
> 
> Giovanni
> 
> 
> 
>> -----Original Message-----
>> From: Joe Witt [mailto:joe.w...@gmail.com]
>> Sent: Tuesday, November 29, 2016 3:25 PM
>> To: users@nifi.apache.org
>> Subject: Re: Keep attributes when merging
>> 
>> Giovanni
>> 
>> You can definitely do this.  The file pulling should be retaining the key 
>> path
>> information as flow file attributes.
>> 
>> The merge process has a property to control what happens with attributes.
>> The default is to only copy over matching attributes and is likely what 
>> you'll
>> want.  Take a look at "Attribute Strategy".  Now you want to retain some key
>> values of course and that would be the parts of the timestamp you'd want to
>> group on.  You could do this with an UpdateAttribute processor before
>> MergeContent.  Use that to create an attribute such as "base-timestamp" or
>> something which just pulls out the common part of the timestamp you want.
>> In MergeContent then you can correlate on this value and since it will be the
>> same it will also be there for you afterwards.  You can then use this when
>> writing to HDFS.
>> 
>> This is a pretty common use case so we can definitely help you get where you
>> want to go with this.
>> 
>> Thanks
>> Joe
>> 
>> On Tue, Nov 29, 2016 at 9:14 AM, Giovanni Lanzani
>> <giovannilanz...@godatadriven.com> wrote:
>>> Hi all,
>>> 
>>> I have the following use case:
>>> 
>>> I'm reading xml from a folder with subfolders using the following schema:
>>> 
>>> /my_folder/20161120/many xml's inside
>>> /my_folder/20161121/many xml's inside
>>> /my_folder/201611.../many xml's inside
>>> 
>>> The current pipeline involves: XML -> JSON -> Avro -> HDFS
>>> 
>>> where the HDFS folder structure is
>>> 
>>> /my_folder/column=20161120/many avro's inside
>>> /my_folder/column=20161121/many avro's inside
>>> /my_folder/column=201611.../many avro's inside
>>> 
>>> (each column= subfolder is a Hive partition)
>>> 
>>> In order to reduce the number of avro's in HDFS, I'd love to merge 'em all.
>>> 
>>> However, as NiFi just reads files from the source folders without any
>> assumption on from which folders they're taken, even if I extract the date 
>> from
>> the folder name (or file), this gets lost when using MergeContent. Using the
>> Defragment strategy does not seems like an option, as I don't know in advance
>> how many files I'll see.
>>> 
>>> That said: isn't there any way to accomplish what I want to do?
>>> 
>>> Current strategy is to simply merge the files "manually" using avro-tools 
>>> and
>> bash scripting.
>>> 
>>> An alternative (although this is forcing what we want to do), is to 
>>> partition by
>> import date. Then I'd only need to take care of the midnight issue, for 
>> example
>> by scheduling NiFi to fetch from the source every 10 minutes, but by doing a
>> MergeContent every 5.
>>> 
>>> If something isn't clear, please let me know.
>>> 
>>> Thanks,
>>> 
>>> Giovanni
>>> 
>>> Thanks,
>>> 
>>> Giovanni

Reply via email to