Re: Data Ingestion forLarge Source Files and Masking

obaidul karim Sun, 03 Jan 2016 03:17:50 -0800

Hi Joe,

I am now exploring your solution.
Starting with below flow:


ListFIle > FetchFile > CompressContent > PutFile.

Seems all fine. Except some confusion with how ListFile identifies new
files.
In order to test, I renamed a already processed file and put in in input
folder and found that the file is not processing.
Then I randomly changed the content of the file and it was immediately
processed.

My question is what is the new file selection criteria for "ListFile" ? Can
I change it only to file name ?

Thanks in advance.

-Obaid







On Fri, Jan 1, 2016 at 10:43 PM, Joe Witt <joe.w...@gmail.com> wrote:

> Hello Obaid,
>
> At 6 TB/day and average size of 2-3GB per dataset you're looking at a
> sustained rate of 70+MB/s and a pretty low transaction rate.  So well
> within a good range to work with on a single system.
>
> 'I's there any way to by pass writing flow files on disk or directly
> pass those files to HDFS as it is ?"
>
>   There is no way to bypass NiFi taking a copy of that data by design.
> NiFi is helping you formulate a graph of dataflow requirements from a
> given source(s) through given processing steps and ultimate driving
> data into given destination systems.  As a result it takes on the
> challenge of handling transactionality of each interaction and the
> buffering and backpressure to deal with the realities of different
> production/consumption patterns.
>
> "If the files on the spool directory are compressed(zip/gzip), can we
> store files on HDFS as uncompressed ?"
>
>   Certainly.  Both of those formats (zip/gzip) are supported in NiFi
> out of the box.  You simply run the data through the proper process
> prior to the PutHDFS process to unpack (zip) or decompress (gzip) as
> needed.
>
> "2.a Can we use our existing java code for masking ? if yes then how ?
> 2.b For this Scenario we also want to bypass storing flow files on
> disk. Can we do it on the fly, masking and storing on HDFS ?
> 2.c If the source files are compressed (zip/gzip), is there any issue
> for masking here ?"
>
>   You would build a custom NiFi processor that leverages your existing
> code.  If your code is able to operate on an InputStream and writes to
> an OutputStream then it is very likely you'll be able to handle
> arbitrarily large objects with zero negative impact to the JVM Heap as
> well.  This is thanks to the fact that the data is present in NiFi's
> repository with copy-on-write/pass-by-reference semantics and that the
> API is exposing those streams to your code in a transactional manner.
>
>   If you want the process of writing to HDFS to also do decompression
> and masking in one pass you'll need to extend/alter the PutHDFS
> process to do that.  It is probably best to implement the flow using
> cohesive processors (grab files, decompress files, mask files, write
> to hdfs).  Given how the repository construct in NiFi works and given
> how caching in Linux works it is very possible you'll be quite
> surprised by the throughput you'll see.  Even then you can optimize
> once you're sure you need to.  The other thing to keep in mind here is
> that often a flow that starts out as specific as this turns into a
> great place to tap the stream of data to feed some new system or new
> algorithm with a different format or protocol.  At that moment the
> benefits become even more obvious.
>
> Regarding the Flume processes in NiFi and their memory usage.  NiFi
> offers a nice hosting mechanism for the Flume processes and brings
> some of the benefits of NiFi's UI, provenance, repository concept.
> However, we're still largely limited to the design assumptions one
> gets when building a Flume process and that can be quite memory
> limiting.  We see what we have today as a great way to help people
> transition their existing Flume flows into NiFi by leveraging their
> existing code but would recommend working to phase the use of those
> out in time so that you can take full benefit of what NiFi brings over
> Flume.
>
> Thanks
> Joe
>
>
> On Fri, Jan 1, 2016 at 4:18 AM, obaidul karim <obaidc...@gmail.com> wrote:
> > Hi,
> >
> > I am new in Nifi and exploring it as open source ETL tool.
> >
> > As per my understanding, flow files are stored on local disk and it
> contains
> > actual data.
> > If above is true, lets consider a below scenario:
> >
> > Scenario 1:
> > - In a spool directory we have terabytes(5-6TB/day) of files coming from
> > external sources
> > - I want to push those files to HDFS as it is without any changes
> >
> > Scenario 2:
> > - In a spool directory we have terabytes(5-6TB/day) of files coming from
> > external sources
> > - I want to mask some of the sensitive columns
> > - Then send one copy to HDFS and another copy to Kafka
> >
> > Question for Scenario 1:
> > 1.a In that case those 5-6TB data will be again written on local disk as
> > flow files and will cause double I/O. Which eventually may cause slower
> > performance due to I/O bottleneck.
> > Is there any way to by pass writing flow files on disk or directly pass
> > those files to HDFS as it is ?
> > 1.b If the files on the spool directory are compressed(zip/gzip), can we
> > store files on HDFS as uncompressed ?
> >
> > Question for Scenario 2:
> > 2.a Can we use our existing java code for masking ? if yes then how ?
> > 2.b For this Scenario we also want to bypass storing flow files on disk.
> Can
> > we do it on the fly, masking and storing on HDFS ?
> > 2.c If the source files are compressed (zip/gzip), is there any issue for
> > masking here ?
> >
> >
> > In fact, I tried above using flume+flume interceptors. Everything working
> > fine with smaller files. But when source files greater that 50MB flume
> > chocks :(.
> > So, I am exploring options in NiFi. Hope I will get some guideline from
> you
> > guys.
> >
> >
> > Thanks in advance.
> > -Obaid
>

Re: Data Ingestion forLarge Source Files and Masking

Reply via email to