Hello NiFi lets you merge data based on some correlation value. So your case you seem to want that correlation to be the combined key of 'type + YYYY + mm'. This is fine/easily configured.
NiFi lets you merge data based on reaching the min/max size requirements or based on time based kick out. And then lets you write to HDFS. So there you have equivalent capability plus all the other benefits of NiFi. The next thing you appear to be adding to this which you'd not get out of Flume then is NiFi clustering. By adding clustering you're also adding a partitioning step because each node will get data from different topic/partitions in Kafka. If you dont want this then make the step of consuming from Kafka be a primary node only task. This way another nifi node will only kick-in when necessary (like in a node failure case). A single node will perform extremely well if you're talking about data rates low enough that you dont' want to wait for bins to reach max size and if you have so much data flowing that you want more nodes running in parallel then the max size will be reached quickly anyway even if split. So I think you have a pretty straight forward path to have great results here. Thanks On Mon, Sep 17, 2018 at 6:20 PM Rob Verkuylen <r...@verkuylen.net> wrote: > > Hi Brian, > > My ultimate goal is to create the least amount of files of the biggest size > on HDFS. In this case I have 32 types of data coming in from a single topic > marked by an attribute and my binning is by <type/YYYY/mm>. If for example I > get data a stream containing data from these 32 types spread out over the > last 2 months, I seem to be getting 32(types)x2(months)x3(nodes)x32MB(size > limit)=6Gigs in my queue before a merge gets triggered. This effect will be > compounded when my next step is binning on 64Mb(12Gig queue) and 128(24Gig > queue needed) thereafter. > > The merge will only get triggered on data from the same node as the > mergeContent is running on. So while there is effectively more than enough > data in the queue to merge, it will hold out untill there is enough on a > single node. This is ok for a data-type with enough data, but some are bigger > than others and sometimes there is enough data in the total queue, but not > enough on a single node, forcing the data to only flow though by age-out. > Resulting in small files on HDFS down the line. > > Flume opens a file in HDFS and starts appending untill max-size or time is > reached. I'm looking for similar or better functionality in Nifi, resulting > in few and large files in HDFS. > > Rob > > On Mon, Sep 17, 2018 at 10:56 PM Bryan Bende <bbe...@gmail.com> wrote: >> >> Hello, >> >> I'm not sure I follow... wouldn't it be more efficient to merge >> multiple files in parallel across the cluster? >> >> If you had to converge them all to one node, then this doesn't seem >> much different than just having a stand-alone NiFi, which would go >> against needing a cluster to achieve the desired through put. >> >> -Bryan >> >> >> -Bryan >> >> On Mon, Sep 17, 2018 at 4:02 PM Rob Verkuylen <r...@verkuylen.net> wrote: >> > >> > I really went to replace Flume with Nifi, so for the simplest use case I >> > basically have Kafka->UpdateAttribute-> >> > MergeContent(32->64->128MB)->PutHDFS. >> > >> > I need to run in cluster mode to get the thoughput I need, but run into >> > the problem of flowfiles assigned to nodes are only merged on those nodes. >> > Effectively splitting my merge efficiencly by the number of nifi nodes. >> > >> > Is there a workaround for this issue?