Re: Data Ingestion forLarge Source Files and Masking

Joe Witt Fri, 01 Jan 2016 06:43:56 -0800

Hello Obaid,

At 6 TB/day and average size of 2-3GB per dataset you're looking at a
sustained rate of 70+MB/s and a pretty low transaction rate.  So well
within a good range to work with on a single system.

'I's there any way to by pass writing flow files on disk or directly
pass those files to HDFS as it is ?"

  There is no way to bypass NiFi taking a copy of that data by design.
NiFi is helping you formulate a graph of dataflow requirements from a
given source(s) through given processing steps and ultimate driving
data into given destination systems.  As a result it takes on the
challenge of handling transactionality of each interaction and the
buffering and backpressure to deal with the realities of different
production/consumption patterns.

"If the files on the spool directory are compressed(zip/gzip), can we
store files on HDFS as uncompressed ?"

  Certainly.  Both of those formats (zip/gzip) are supported in NiFi
out of the box.  You simply run the data through the proper process
prior to the PutHDFS process to unpack (zip) or decompress (gzip) as
needed.

"2.a Can we use our existing java code for masking ? if yes then how ?
2.b For this Scenario we also want to bypass storing flow files on
disk. Can we do it on the fly, masking and storing on HDFS ?
2.c If the source files are compressed (zip/gzip), is there any issue
for masking here ?"

  You would build a custom NiFi processor that leverages your existing
code.  If your code is able to operate on an InputStream and writes to
an OutputStream then it is very likely you'll be able to handle
arbitrarily large objects with zero negative impact to the JVM Heap as
well.  This is thanks to the fact that the data is present in NiFi's
repository with copy-on-write/pass-by-reference semantics and that the
API is exposing those streams to your code in a transactional manner.

  If you want the process of writing to HDFS to also do decompression
and masking in one pass you'll need to extend/alter the PutHDFS
process to do that.  It is probably best to implement the flow using
cohesive processors (grab files, decompress files, mask files, write
to hdfs).  Given how the repository construct in NiFi works and given
how caching in Linux works it is very possible you'll be quite
surprised by the throughput you'll see.  Even then you can optimize
once you're sure you need to.  The other thing to keep in mind here is
that often a flow that starts out as specific as this turns into a
great place to tap the stream of data to feed some new system or new
algorithm with a different format or protocol.  At that moment the
benefits become even more obvious.

Regarding the Flume processes in NiFi and their memory usage.  NiFi
offers a nice hosting mechanism for the Flume processes and brings
some of the benefits of NiFi's UI, provenance, repository concept.
However, we're still largely limited to the design assumptions one
gets when building a Flume process and that can be quite memory
limiting.  We see what we have today as a great way to help people
transition their existing Flume flows into NiFi by leveraging their
existing code but would recommend working to phase the use of those
out in time so that you can take full benefit of what NiFi brings over
Flume.

Thanks
Joe

On Fri, Jan 1, 2016 at 4:18 AM, obaidul karim <obaidc...@gmail.com> wrote:
> Hi,
>
> I am new in Nifi and exploring it as open source ETL tool.
>
> As per my understanding, flow files are stored on local disk and it contains
> actual data.
> If above is true, lets consider a below scenario:
>
> Scenario 1:
> - In a spool directory we have terabytes(5-6TB/day) of files coming from
> external sources
> - I want to push those files to HDFS as it is without any changes
>
> Scenario 2:
> - In a spool directory we have terabytes(5-6TB/day) of files coming from
> external sources
> - I want to mask some of the sensitive columns
> - Then send one copy to HDFS and another copy to Kafka
>
> Question for Scenario 1:
> 1.a In that case those 5-6TB data will be again written on local disk as
> flow files and will cause double I/O. Which eventually may cause slower
> performance due to I/O bottleneck.
> Is there any way to by pass writing flow files on disk or directly pass
> those files to HDFS as it is ?
> 1.b If the files on the spool directory are compressed(zip/gzip), can we
> store files on HDFS as uncompressed ?
>
> Question for Scenario 2:
> 2.a Can we use our existing java code for masking ? if yes then how ?
> 2.b For this Scenario we also want to bypass storing flow files on disk. Can
> we do it on the fly, masking and storing on HDFS ?
> 2.c If the source files are compressed (zip/gzip), is there any issue for
> masking here ?
>
>
> In fact, I tried above using flume+flume interceptors. Everything working
> fine with smaller files. But when source files greater that 50MB flume
> chocks :(.
> So, I am exploring options in NiFi. Hope I will get some guideline from you
> guys.
>
>
> Thanks in advance.
> -Obaid

Re: Data Ingestion forLarge Source Files and Masking

Reply via email to