Hello

NiFi lets you merge data based on some correlation value.  So your
case you seem to want that correlation to be the combined key of 'type
+ YYYY + mm'.  This is fine/easily configured.

NiFi lets you merge data based on reaching the min/max size
requirements or based on time based kick out.  And then lets you write
to HDFS.  So there you have equivalent capability plus all the other
benefits of NiFi.

The next thing you appear to be adding to this which you'd not get out
of Flume then is NiFi clustering.  By adding clustering you're also
adding a partitioning step because each node will get data from
different topic/partitions in Kafka.  If you dont want this then make
the step of consuming from Kafka be a primary node only task.  This
way another nifi node will only kick-in when necessary (like in a node
failure case).  A single node will perform extremely well if you're
talking about data rates low enough that you dont' want to wait for
bins to reach max size and if you have so much data flowing that you
want more nodes running in parallel then the max size will be reached
quickly anyway even if split.

So I think you have a pretty straight forward path to have great results here.

Thanks
On Mon, Sep 17, 2018 at 6:20 PM Rob Verkuylen <r...@verkuylen.net> wrote:
>
> Hi Brian,
>
> My ultimate goal is to create the least amount of files of the biggest size 
> on HDFS. In this case I have 32 types of data coming in from a single topic 
> marked by an attribute and my binning is by <type/YYYY/mm>. If for example I 
> get data a stream containing data from these 32 types spread out over the 
> last 2 months, I seem to be getting 32(types)x2(months)x3(nodes)x32MB(size 
> limit)=6Gigs in my queue before a merge gets triggered. This effect will be 
> compounded when my next step is binning on 64Mb(12Gig queue) and 128(24Gig 
> queue needed) thereafter.
>
> The merge will only get triggered on data from the same node as the 
> mergeContent is running on. So while there is effectively more than enough 
> data in the queue to merge, it will hold out untill there is enough on a 
> single node. This is ok for a data-type with enough data, but some are bigger 
> than others and sometimes there is enough data in the total queue, but not 
> enough on a single node, forcing the data to only flow though by age-out. 
> Resulting in small files on HDFS down the line.
>
> Flume opens a file in HDFS and starts appending untill max-size or time is 
> reached. I'm looking for similar or better functionality in Nifi, resulting 
> in few and large files in HDFS.
>
> Rob
>
> On Mon, Sep 17, 2018 at 10:56 PM Bryan Bende <bbe...@gmail.com> wrote:
>>
>> Hello,
>>
>> I'm not sure I follow... wouldn't it be more efficient to merge
>> multiple files in parallel across the cluster?
>>
>> If you had to converge them all to one node, then this doesn't seem
>> much different than just having a stand-alone NiFi, which would go
>> against needing a cluster to achieve the desired through put.
>>
>> -Bryan
>>
>>
>> -Bryan
>>
>> On Mon, Sep 17, 2018 at 4:02 PM Rob Verkuylen <r...@verkuylen.net> wrote:
>> >
>> > I really went to replace Flume with Nifi, so for the simplest use case I 
>> > basically have Kafka->UpdateAttribute-> 
>> > MergeContent(32->64->128MB)->PutHDFS.
>> >
>> > I need to run in cluster mode to get the thoughput I need, but run into 
>> > the problem of flowfiles assigned to nodes are only merged on those nodes. 
>> > Effectively splitting my merge efficiencly by the number of nifi nodes.
>> >
>> > Is there a workaround for this issue?

Reply via email to