Hi Joe, I am now exploring your solution. Starting with below flow:
ListFIle > FetchFile > CompressContent > PutFile. Seems all fine. Except some confusion with how ListFile identifies new files. In order to test, I renamed a already processed file and put in in input folder and found that the file is not processing. Then I randomly changed the content of the file and it was immediately processed. My question is what is the new file selection criteria for "ListFile" ? Can I change it only to file name ? Thanks in advance. -Obaid On Fri, Jan 1, 2016 at 10:43 PM, Joe Witt <joe.w...@gmail.com> wrote: > Hello Obaid, > > At 6 TB/day and average size of 2-3GB per dataset you're looking at a > sustained rate of 70+MB/s and a pretty low transaction rate. So well > within a good range to work with on a single system. > > 'I's there any way to by pass writing flow files on disk or directly > pass those files to HDFS as it is ?" > > There is no way to bypass NiFi taking a copy of that data by design. > NiFi is helping you formulate a graph of dataflow requirements from a > given source(s) through given processing steps and ultimate driving > data into given destination systems. As a result it takes on the > challenge of handling transactionality of each interaction and the > buffering and backpressure to deal with the realities of different > production/consumption patterns. > > "If the files on the spool directory are compressed(zip/gzip), can we > store files on HDFS as uncompressed ?" > > Certainly. Both of those formats (zip/gzip) are supported in NiFi > out of the box. You simply run the data through the proper process > prior to the PutHDFS process to unpack (zip) or decompress (gzip) as > needed. > > "2.a Can we use our existing java code for masking ? if yes then how ? > 2.b For this Scenario we also want to bypass storing flow files on > disk. Can we do it on the fly, masking and storing on HDFS ? > 2.c If the source files are compressed (zip/gzip), is there any issue > for masking here ?" > > You would build a custom NiFi processor that leverages your existing > code. If your code is able to operate on an InputStream and writes to > an OutputStream then it is very likely you'll be able to handle > arbitrarily large objects with zero negative impact to the JVM Heap as > well. This is thanks to the fact that the data is present in NiFi's > repository with copy-on-write/pass-by-reference semantics and that the > API is exposing those streams to your code in a transactional manner. > > If you want the process of writing to HDFS to also do decompression > and masking in one pass you'll need to extend/alter the PutHDFS > process to do that. It is probably best to implement the flow using > cohesive processors (grab files, decompress files, mask files, write > to hdfs). Given how the repository construct in NiFi works and given > how caching in Linux works it is very possible you'll be quite > surprised by the throughput you'll see. Even then you can optimize > once you're sure you need to. The other thing to keep in mind here is > that often a flow that starts out as specific as this turns into a > great place to tap the stream of data to feed some new system or new > algorithm with a different format or protocol. At that moment the > benefits become even more obvious. > > Regarding the Flume processes in NiFi and their memory usage. NiFi > offers a nice hosting mechanism for the Flume processes and brings > some of the benefits of NiFi's UI, provenance, repository concept. > However, we're still largely limited to the design assumptions one > gets when building a Flume process and that can be quite memory > limiting. We see what we have today as a great way to help people > transition their existing Flume flows into NiFi by leveraging their > existing code but would recommend working to phase the use of those > out in time so that you can take full benefit of what NiFi brings over > Flume. > > Thanks > Joe > > > On Fri, Jan 1, 2016 at 4:18 AM, obaidul karim <obaidc...@gmail.com> wrote: > > Hi, > > > > I am new in Nifi and exploring it as open source ETL tool. > > > > As per my understanding, flow files are stored on local disk and it > contains > > actual data. > > If above is true, lets consider a below scenario: > > > > Scenario 1: > > - In a spool directory we have terabytes(5-6TB/day) of files coming from > > external sources > > - I want to push those files to HDFS as it is without any changes > > > > Scenario 2: > > - In a spool directory we have terabytes(5-6TB/day) of files coming from > > external sources > > - I want to mask some of the sensitive columns > > - Then send one copy to HDFS and another copy to Kafka > > > > Question for Scenario 1: > > 1.a In that case those 5-6TB data will be again written on local disk as > > flow files and will cause double I/O. Which eventually may cause slower > > performance due to I/O bottleneck. > > Is there any way to by pass writing flow files on disk or directly pass > > those files to HDFS as it is ? > > 1.b If the files on the spool directory are compressed(zip/gzip), can we > > store files on HDFS as uncompressed ? > > > > Question for Scenario 2: > > 2.a Can we use our existing java code for masking ? if yes then how ? > > 2.b For this Scenario we also want to bypass storing flow files on disk. > Can > > we do it on the fly, masking and storing on HDFS ? > > 2.c If the source files are compressed (zip/gzip), is there any issue for > > masking here ? > > > > > > In fact, I tried above using flume+flume interceptors. Everything working > > fine with smaller files. But when source files greater that 50MB flume > > chocks :(. > > So, I am exploring options in NiFi. Hope I will get some guideline from > you > > guys. > > > > > > Thanks in advance. > > -Obaid >