Re: Data Ingestion forLarge Source Files and Masking

obaidul karim Wed, 13 Jan 2016 20:34:16 -0800

Hi Joe,

Please find attached jstat & iostat output.


So far it seems to me that it is CPU bound. However, your eyes are better
tan mine :).

-Obaid

On Thu, Jan 14, 2016 at 11:51 AM, Joe Witt <joe.w...@gmail.com> wrote:

> Hello
>
> Let's narrow in on potential issues.  So while this process is running
> and appears sluggish in nature please run the following on the command
> line
>
> 'jps'
>
> This command will tell you the process id of NiFi.  You'll want the
> pid associated with the Java process other than what is called 'jps'
> presuming there aren't other things running than NiFi at the time.
>
> Lets say the result is a pid of '12345'
>
> Then run this command
>
> 'jstat -gcutil 12345 1000'
>
> This will generate garbage collection information every one second
> until you decide to stop it with cntl-c.  So let that run for a while
> say 30 seconds or so then hit cntl-c.  Can you please paste that
> output in response.  That will show us how the general health of GC
> is.
>
> Another really important/powerful set of output can be gleaned by
> running 'iostat' which gives you statistics about input/output to
> things like the underlying storage system.  That is part of the
> 'sysstat' package in case you need to install that.  But then you can
> run
>
> ''iostat xmh 1"
>
> Or something even as simple as 'iostat 1'.  Your specific command
> string may vary.  Please let that run for say 10-20 seconds and paste
> those results as well.  That will give a sense of io utilization while
> the operation is running.
>
> Between these two outputs (Garbage Collection/IO) we should have a
> pretty good idea of where to focus the effort to find why it is slow.
>
> Thanks
> Joe
>
>
> On Wed, Jan 13, 2016 at 9:23 PM, obaidul karim <obaidc...@gmail.com>
> wrote:
> > Hi Joe & Others,
> >
> > Thanks for all of your suggestions.
> >
> > Now I am using below code:
> > 1. Buffered reader (I tried to use NLKBufferedReader, but it requires too
> > many libs & Nifi failed to start. I was lost.)
> > 2. Buffered writer
> > 3. Using appending line end instead to concat new line
> >
> > Still no performance gain. Am I doing something wrong, anything else I
> can
> > change here.
> >
> > flowfile = session.write(flowfile, new StreamCallback() {
> > @Override
> > public void process(InputStream in, OutputStream out) throws IOException
> {
> >     try (BufferedReader reader = new BufferedReader(new
> > InputStreamReader(in, charset), maxBufferSize);
> >         BufferedWriter writer = new BufferedWriter(new
> > OutputStreamWriter(out, charset));) {
> >
> > if(skipHeader == true && headerExists==true) { // to skip header, do an
> > additional line fetch before going to next step
> > if(reader.ready())   reader.readLine();
> > } else if( skipHeader == false && headerExists == true) { // if header is
> > not skipped then no need to mask, just pass through
> > if(reader.ready())  {
> > writer.write(reader.readLine());
> > writer.write(lineEndingBuilder.toString());
> > }
> > }
> > // decide about empty line earlier
> > String line;
> > while ((line = reader.readLine()) != null) {
> > writer.write(parseLine(line, seperator, quote, escape, maskColumns));
> > writer.write(lineEndingBuilder.toString());
> > };
> > writer.flush();
> >         }
> > }
> >
> > });
> >
> >
> > -Obaid
> >
> > On Wed, Jan 13, 2016 at 1:38 PM, Joe Witt <joe.w...@gmail.com> wrote:
> >>
> >> Hello
> >>
> >> So the performance went from what sounded pretty good to what sounds
> >> pretty problematic.  The rate now sounds like it is around 5MB/s which
> >> is indeed quite poor.  Building on what Bryan said there does appear
> >> to be some good opportunities to improve the performance.  The link he
> >> provided just expanded to cover the full range to look at is here [1].
> >>
> >> Couple key points to note:
> >> 1) Use of a buffered line oriented reader than preserves the new lines
> >> 2) write to a buffered writer that accepts strings and understands
> >> which charset you intend to write out
> >> 3) avoid strong concat with newline
> >>
> >> Also keep in mind you how large any single line could be because if
> >> they can be quite large you may need to consider the GC pressure that
> >> can be caused.  But let's take a look at how things are after these
> >> easier steps first.
> >>
> >> [1]
> >>
> https://github.com/apache/nifi/blob/ee14d8f9dd0c3f18920d910fcddd6d79b8b9f9cf/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ReplaceText.java#L334-L361
> >>
> >> Thanks
> >> Joe
> >>
> >> On Tue, Jan 12, 2016 at 10:35 PM, Juan Sequeiros <helloj...@gmail.com>
> >> wrote:
> >> > Obaid,
> >> >
> >> > Since you mention that you will have dedicated ETL servers and assume
> >> > they
> >> > will also have a decent amount of ram on them, then I would not shy
> away
> >> > from increasing your threads.
> >> >
> >> > Also in your staging directory if you do not need to keep originals,
> >> > then
> >> > might consider GetFile and on that one use one thread.
> >> >
> >> > Hi Joe,
> >> >
> >> > Yes, I took consideration of existinh RAID and HW settings. We have
> 10G
> >> > NIC
> >> > for all hadoop intra-connectivity and the server in question is an
> edge
> >> > node
> >> > of our hadoop cluster.
> >> > In production scenario we will use dedicated ETL servers having high
> >> > performance(>500MB/s) local disks.
> >> >
> >> > Sharing a good news, I have successfully mask & load to HDFS 110 GB
> data
> >> > using below flow:
> >> >
> >> > ExecuteProcess(touch and mv to input dir) > ListFile (1 thread) >
> >> > FetchFile
> >> > (1 thread) > maskColumn(4 threads) > PutHDFS (1 threads).
> >> >
> >> > * used 4 threads for masking and 1 for other because I found it is the
> >> > slowest component.
> >> >
> >> > However, It seems to be too slow. It was processing 2GB files in  6
> >> > minutes.
> >> > It may be because of my masking algorithm(although masking algorithm
> is
> >> > pretty simple FPE with some simple twist).
> >> > However I want to be sure that the way I have written custom processor
> >> > is
> >> > the most efficient way. Please below code chunk and let me know
> whether
> >> > it
> >> > is the fastest way to process flowfiles (csv source files) which needs
> >> > modifications on specific columns:
> >> >
> >> > * parseLine method contains logic for masking.
> >> >
> >> >        flowfile = session.write(flowfile, new StreamCallback() {
> >> >         @Override
> >> >            public void process(InputStream in, OutputStream out)
> throws
> >> > IOException {
> >> >
> >> >         BufferedReader reader = new BufferedReader(new
> >> > InputStreamReader(in));
> >> >         String line;
> >> >         if(skipHeader == true && headerExists==true) { // to skip
> >> > header, do
> >> > an additional line fetch before going to next step
> >> >         if(reader.ready())   reader.readLine();
> >> >         } else if( skipHeader == false && headerExists == true) { //
> if
> >> > header is not skipped then no need to mask, just pass through
> >> >         if(reader.ready())
> >> > out.write((reader.readLine()+"\n").getBytes());
> >> >         }
> >> >
> >> >         // decide about empty line earlier
> >> >         while ((line = reader.readLine()) != null) {
> >> >         if(line.trim().length() > 0 ) {
> >> >         out.write( parseLine(line, seperator, quote, escape,
> >> > maskColumns).getBytes() );
> >> >         }
> >> > };
> >> > out.flush();
> >> >            }
> >> >        });
> >> >
> >> >
> >> >
> >> >
> >> > Thanks in advance.
> >> > -Obaid
> >> >
> >> >
> >> > On Tue, Jan 5, 2016 at 12:36 PM, Joe Witt <joe.w...@gmail.com> wrote:
> >> >>
> >> >> Obaid,
> >> >>
> >> >> Really happy you're seeing the performance you need.  That works out
> >> >> to about 110MB/s on average over that period.  Any chance you have a
> >> >> 1GB NIC?  If you really want to have fun with performance tuning you
> >> >> can use things like iostat and other commands to observe disk,
> >> >> network, cpu.  Something else to consider too is the potential
> >> >> throughput gains of multiple RAID-1 containers rather than RAID-5
> >> >> since NiFi can use both in parallel.  Depends on your goals/workload
> >> >> so just an FYI.
> >> >>
> >> >> A good reference for how to build a processor which does altering of
> >> >> the data (transformation) is here [1].  It is a good idea to do a
> >> >> quick read through that document.  Also, one of the great things you
> >> >> can do as well is look at existing processors.  Some good examples
> >> >> relevant to transformation are [2], [3], and [4] which are quite
> >> >> simple stream transform types. Or take a look at [5] which is a more
> >> >> complicated example.  You might also be excited to know that there is
> >> >> some really cool work done to bring various languages into NiFi which
> >> >> looks on track to be available in the upcoming 0.5.0 release which is
> >> >> NIFI-210 [6].  That will provide a really great option to quickly
> >> >> build transforms using languages like Groovy, JRuby, Javascript,
> >> >> Scala, Lua, Javascript, and Jython.
> >> >>
> >> >> [1]
> >> >>
> >> >>
> https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#enrich-modify-content
> >> >>
> >> >> [2]
> >> >>
> >> >>
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/Base64EncodeContent.java
> >> >>
> >> >> [3]
> >> >>
> >> >>
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/TransformXml.java
> >> >>
> >> >> [4]
> >> >>
> >> >>
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ModifyBytes.java
> >> >>
> >> >> [5]
> >> >>
> >> >>
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ReplaceText.java
> >> >>
> >> >> [6] https://issues.apache.org/jira/browse/NIFI-210
> >> >>
> >> >> Thanks
> >> >> Joe
> >> >>
> >> >> On Mon, Jan 4, 2016 at 9:32 PM, obaidul karim <obaidc...@gmail.com>
> >> >> wrote:
> >> >> > Hi Joe,
> >> >> >
> >> >> > Just completed by test with 100GB data (on a local RAID 5 disk on a
> >> >> > single
> >> >> > server).
> >> >> >
> >> >> > I was able to load 100GB data within 15 minutes(awesome!!) using
> >> >> > below
> >> >> > flow.
> >> >> > This throughput is enough to load 10TB data in a day with a single
> >> >> > and
> >> >> > simple machine.
> >> >> > During the test, server disk I/O went up to 200MB/s.
> >> >> >
> >> >> >     ExecuteProcess(touch and mv to input dir) > ListFile >
> FetchFile
> >> >> > (4
> >> >> > threads) > PutHDFS (4 threads)
> >> >> >
> >> >> > My Next action is to incorporate my java code for column masking
> with
> >> >> > a
> >> >> > custom processor.
> >> >> > I am now exploring on that. However, if you have any good reference
> >> >> > on
> >> >> > custom processor(altering actual data) please let  me know.
> >> >> >
> >> >> > Thanks,
> >> >> > Obaid
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Mon, Jan 4, 2016 at 9:11 AM, obaidul karim <obaidc...@gmail.com
> >
> >> >> > wrote:
> >> >> >>
> >> >> >> Hi Joe,
> >> >> >>
> >> >> >> Yes, symlink is another option I was thinking when I was trying to
> >> >> >> use
> >> >> >> getfile.
> >> >> >> Thanks for your insights, I will update you on this mail chain
> when
> >> >> >> my
> >> >> >> entire workflow completes. So that thus could be an reference for
> >> >> >> other
> >> >> >> :).
> >> >> >>
> >> >> >> -Obaid
> >> >> >>
> >> >> >> On Monday, January 4, 2016, Joe Witt <joe.w...@gmail.com> wrote:
> >> >> >>>
> >> >> >>> Obaid,
> >> >> >>>
> >> >> >>> You make a great point.
> >> >> >>>
> >> >> >>> I agree we will ultimately need to do more to make that very
> valid
> >> >> >>> approach work easily.  The downside is that puts the onus on NiFi
> >> >> >>> to
> >> >> >>> keep track of a variety of potentially quite large state about
> the
> >> >> >>> directory.  One way to avoid that expense is if NiFi can pull a
> >> >> >>> copy
> >> >> >>> of then delete the source file.  If you'd like to keep a copy
> >> >> >>> around I
> >> >> >>> wonder if a good approach is to simply create a symlink to the
> >> >> >>> original file you want NiFi to pull but have the symlink in the
> >> >> >>> NiFi
> >> >> >>> pickup directory.  NiFi is then free to read and delete which
> means
> >> >> >>> it
> >> >> >>> simply pulls whatever shows up in that directory and doesn't have
> >> >> >>> to
> >> >> >>> keep state about filenames and checksums.
> >> >> >>>
> >> >> >>> I realize we still need to do what you're suggesting as well but
> >> >> >>> thought I'd run this by you.
> >> >> >>>
> >> >> >>> Joe
> >> >> >>>
> >> >> >>> On Sun, Jan 3, 2016 at 6:43 PM, obaidul karim <
> obaidc...@gmail.com>
> >> >> >>> wrote:
> >> >> >>> > Hi Joe,
> >> >> >>> >
> >> >> >>> > Condider a scenerio, where we need to feed some older files and
> >> >> >>> > we
> >> >> >>> > are
> >> >> >>> > using
> >> >> >>> > "mv" to feed files to input directory( to reduce IO we may use
> >> >> >>> > "mv").
> >> >> >>> > If we
> >> >> >>> > use "mv", last modified date will not changed. And this is very
> >> >> >>> > common
> >> >> >>> > on a
> >> >> >>> > busy file collection system.
> >> >> >>> >
> >> >> >>> > However, I think I can still manage it by adding additional
> >> >> >>> > "touch"
> >> >> >>> > before
> >> >> >>> > moving fole in the target directory.
> >> >> >>> >
> >> >> >>> > So, my suggestion is to add file selection criteria as an
> >> >> >>> > configurable
> >> >> >>> > option in listfile process on workflow. Options could be last
> >> >> >>> > modified
> >> >> >>> > date(as current one) unique file names, checksum etc.
> >> >> >>> >
> >> >> >>> > Thanks again man.
> >> >> >>> > -Obaid
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > On Monday, January 4, 2016, Joe Witt <joe.w...@gmail.com>
> wrote:
> >> >> >>> >>
> >> >> >>> >> Hello Obaid,
> >> >> >>> >>
> >> >> >>> >> The default behavior of the ListFile processor is to keep
> track
> >> >> >>> >> of
> >> >> >>> >> the
> >> >> >>> >> last modified time of the files it lists.  When you changed
> the
> >> >> >>> >> name
> >> >> >>> >> of the file that doesn't change the last modified time as
> >> >> >>> >> tracked
> >> >> >>> >> by
> >> >> >>> >> the OS but when you altered content it does.  Simply 'touch'
> on
> >> >> >>> >> the
> >> >> >>> >> file would do it too.
> >> >> >>> >>
> >> >> >>> >> I believe we could observe the last modified time of the
> >> >> >>> >> directory
> >> >> >>> >> in
> >> >> >>> >> which the file lives to detect something like a rename.
> >> >> >>> >> However,
> >> >> >>> >> we'd
> >> >> >>> >> not know which file was renamed just that something was
> changed.
> >> >> >>> >> So
> >> >> >>> >> it require keeping some potentially problematic state to
> >> >> >>> >> deconflict
> >> >> >>> >> or
> >> >> >>> >> requiring the user to have a duplicate detection process
> >> >> >>> >> afterwards.
> >> >> >>> >>
> >> >> >>> >> So with that in mind is the current behavior sufficient for
> your
> >> >> >>> >> case?
> >> >> >>> >>
> >> >> >>> >> Thanks
> >> >> >>> >> Joe
> >> >> >>> >>
> >> >> >>> >> On Sun, Jan 3, 2016 at 6:17 AM, obaidul karim
> >> >> >>> >> <obaidc...@gmail.com>
> >> >> >>> >> wrote:
> >> >> >>> >> > Hi Joe,
> >> >> >>> >> >
> >> >> >>> >> > I am now exploring your solution.
> >> >> >>> >> > Starting with below flow:
> >> >> >>> >> >
> >> >> >>> >> > ListFIle > FetchFile > CompressContent > PutFile.
> >> >> >>> >> >
> >> >> >>> >> > Seems all fine. Except some confusion with how ListFile
> >> >> >>> >> > identifies
> >> >> >>> >> > new
> >> >> >>> >> > files.
> >> >> >>> >> > In order to test, I renamed a already processed file and put
> >> >> >>> >> > in
> >> >> >>> >> > in
> >> >> >>> >> > input
> >> >> >>> >> > folder and found that the file is not processing.
> >> >> >>> >> > Then I randomly changed the content of the file and it was
> >> >> >>> >> > immediately
> >> >> >>> >> > processed.
> >> >> >>> >> >
> >> >> >>> >> > My question is what is the new file selection criteria for
> >> >> >>> >> > "ListFile" ?
> >> >> >>> >> > Can
> >> >> >>> >> > I change it only to file name ?
> >> >> >>> >> >
> >> >> >>> >> > Thanks in advance.
> >> >> >>> >> >
> >> >> >>> >> > -Obaid
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> > On Fri, Jan 1, 2016 at 10:43 PM, Joe Witt <
> joe.w...@gmail.com>
> >> >> >>> >> > wrote:
> >> >> >>> >> >>
> >> >> >>> >> >> Hello Obaid,
> >> >> >>> >> >>
> >> >> >>> >> >> At 6 TB/day and average size of 2-3GB per dataset you're
> >> >> >>> >> >> looking
> >> >> >>> >> >> at
> >> >> >>> >> >> a
> >> >> >>> >> >> sustained rate of 70+MB/s and a pretty low transaction
> rate.
> >> >> >>> >> >> So
> >> >> >>> >> >> well
> >> >> >>> >> >> within a good range to work with on a single system.
> >> >> >>> >> >>
> >> >> >>> >> >> 'I's there any way to by pass writing flow files on disk or
> >> >> >>> >> >> directly
> >> >> >>> >> >> pass those files to HDFS as it is ?"
> >> >> >>> >> >>
> >> >> >>> >> >>   There is no way to bypass NiFi taking a copy of that data
> >> >> >>> >> >> by
> >> >> >>> >> >> design.
> >> >> >>> >> >> NiFi is helping you formulate a graph of dataflow
> >> >> >>> >> >> requirements
> >> >> >>> >> >> from
> >> >> >>> >> >> a
> >> >> >>> >> >> given source(s) through given processing steps and ultimate
> >> >> >>> >> >> driving
> >> >> >>> >> >> data into given destination systems.  As a result it takes
> on
> >> >> >>> >> >> the
> >> >> >>> >> >> challenge of handling transactionality of each interaction
> >> >> >>> >> >> and
> >> >> >>> >> >> the
> >> >> >>> >> >> buffering and backpressure to deal with the realities of
> >> >> >>> >> >> different
> >> >> >>> >> >> production/consumption patterns.
> >> >> >>> >> >>
> >> >> >>> >> >> "If the files on the spool directory are
> >> >> >>> >> >> compressed(zip/gzip),
> >> >> >>> >> >> can
> >> >> >>> >> >> we
> >> >> >>> >> >> store files on HDFS as uncompressed ?"
> >> >> >>> >> >>
> >> >> >>> >> >>   Certainly.  Both of those formats (zip/gzip) are
> supported
> >> >> >>> >> >> in
> >> >> >>> >> >> NiFi
> >> >> >>> >> >> out of the box.  You simply run the data through the proper
> >> >> >>> >> >> process
> >> >> >>> >> >> prior to the PutHDFS process to unpack (zip) or decompress
> >> >> >>> >> >> (gzip)
> >> >> >>> >> >> as
> >> >> >>> >> >> needed.
> >> >> >>> >> >>
> >> >> >>> >> >> "2.a Can we use our existing java code for masking ? if yes
> >> >> >>> >> >> then
> >> >> >>> >> >> how ?
> >> >> >>> >> >> 2.b For this Scenario we also want to bypass storing flow
> >> >> >>> >> >> files
> >> >> >>> >> >> on
> >> >> >>> >> >> disk. Can we do it on the fly, masking and storing on HDFS
> ?
> >> >> >>> >> >> 2.c If the source files are compressed (zip/gzip), is there
> >> >> >>> >> >> any
> >> >> >>> >> >> issue
> >> >> >>> >> >> for masking here ?"
> >> >> >>> >> >>
> >> >> >>> >> >>   You would build a custom NiFi processor that leverages
> your
> >> >> >>> >> >> existing
> >> >> >>> >> >> code.  If your code is able to operate on an InputStream
> and
> >> >> >>> >> >> writes
> >> >> >>> >> >> to
> >> >> >>> >> >> an OutputStream then it is very likely you'll be able to
> >> >> >>> >> >> handle
> >> >> >>> >> >> arbitrarily large objects with zero negative impact to the
> >> >> >>> >> >> JVM
> >> >> >>> >> >> Heap
> >> >> >>> >> >> as
> >> >> >>> >> >> well.  This is thanks to the fact that the data is present
> in
> >> >> >>> >> >> NiFi's
> >> >> >>> >> >> repository with copy-on-write/pass-by-reference semantics
> and
> >> >> >>> >> >> that
> >> >> >>> >> >> the
> >> >> >>> >> >> API is exposing those streams to your code in a
> transactional
> >> >> >>> >> >> manner.
> >> >> >>> >> >>
> >> >> >>> >> >>   If you want the process of writing to HDFS to also do
> >> >> >>> >> >> decompression
> >> >> >>> >> >> and masking in one pass you'll need to extend/alter the
> >> >> >>> >> >> PutHDFS
> >> >> >>> >> >> process to do that.  It is probably best to implement the
> >> >> >>> >> >> flow
> >> >> >>> >> >> using
> >> >> >>> >> >> cohesive processors (grab files, decompress files, mask
> >> >> >>> >> >> files,
> >> >> >>> >> >> write
> >> >> >>> >> >> to hdfs).  Given how the repository construct in NiFi works
> >> >> >>> >> >> and
> >> >> >>> >> >> given
> >> >> >>> >> >> how caching in Linux works it is very possible you'll be
> >> >> >>> >> >> quite
> >> >> >>> >> >> surprised by the throughput you'll see.  Even then you can
> >> >> >>> >> >> optimize
> >> >> >>> >> >> once you're sure you need to.  The other thing to keep in
> >> >> >>> >> >> mind
> >> >> >>> >> >> here
> >> >> >>> >> >> is
> >> >> >>> >> >> that often a flow that starts out as specific as this turns
> >> >> >>> >> >> into
> >> >> >>> >> >> a
> >> >> >>> >> >> great place to tap the stream of data to feed some new
> system
> >> >> >>> >> >> or
> >> >> >>> >> >> new
> >> >> >>> >> >> algorithm with a different format or protocol.  At that
> >> >> >>> >> >> moment
> >> >> >>> >> >> the
> >> >> >>> >> >> benefits become even more obvious.
> >> >> >>> >> >>
> >> >> >>> >> >> Regarding the Flume processes in NiFi and their memory
> usage.
> >> >> >>> >> >> NiFi
> >> >> >>> >> >> offers a nice hosting mechanism for the Flume processes and
> >> >> >>> >> >> brings
> >> >> >>> >> >> some of the benefits of NiFi's UI, provenance, repository
> >> >> >>> >> >> concept.
> >> >> >>> >> >> However, we're still largely limited to the design
> >> >> >>> >> >> assumptions
> >> >> >>> >> >> one
> >> >> >>> >> >> gets when building a Flume process and that can be quite
> >> >> >>> >> >> memory
> >> >> >>> >> >> limiting.  We see what we have today as a great way to help
> >> >> >>> >> >> people
> >> >> >>> >> >> transition their existing Flume flows into NiFi by
> leveraging
> >> >> >>> >> >> their
> >> >> >>> >> >> existing code but would recommend working to phase the use
> of
> >> >> >>> >> >> those
> >> >> >>> >> >> out in time so that you can take full benefit of what NiFi
> >> >> >>> >> >> brings
> >> >> >>> >> >> over
> >> >> >>> >> >> Flume.
> >> >> >>> >> >>
> >> >> >>> >> >> Thanks
> >> >> >>> >> >> Joe
> >> >> >>> >> >>
> >> >> >>> >> >>
> >> >> >>> >> >> On Fri, Jan 1, 2016 at 4:18 AM, obaidul karim
> >> >> >>> >> >> <obaidc...@gmail.com>
> >> >> >>> >> >> wrote:
> >> >> >>> >> >> > Hi,
> >> >> >>> >> >> >
> >> >> >>> >> >> > I am new in Nifi and exploring it as open source ETL
> tool.
> >> >> >>> >> >> >
> >> >> >>> >> >> > As per my understanding, flow files are stored on local
> >> >> >>> >> >> > disk
> >> >> >>> >> >> > and
> >> >> >>> >> >> > it
> >> >> >>> >> >> > contains
> >> >> >>> >> >> > actual data.
> >> >> >>> >> >> > If above is true, lets consider a below scenario:
> >> >> >>> >> >> >
> >> >> >>> >> >> > Scenario 1:
> >> >> >>> >> >> > - In a spool directory we have terabytes(5-6TB/day) of
> >> >> >>> >> >> > files
> >> >> >>> >> >> > coming
> >> >> >>> >> >> > from
> >> >> >>> >> >> > external sources
> >> >> >>> >> >> > - I want to push those files to HDFS as it is without any
> >> >> >>> >> >> > changes
> >> >> >>> >> >> >
> >> >> >>> >> >> > Scenario 2:
> >> >> >>> >> >> > - In a spool directory we have terabytes(5-6TB/day) of
> >> >> >>> >> >> > files
> >> >> >>> >> >> > coming
> >> >> >>> >> >> > from
> >> >> >>> >> >> > external sources
> >> >> >>> >> >> > - I want to mask some of the sensitive columns
> >> >> >>> >> >> > - Then send one copy to HDFS and another copy to Kafka
> >> >> >>> >> >> >
> >> >> >>> >> >> > Question for Scenario 1:
> >> >> >>> >> >> > 1.a In that case those 5-6TB data will be again written
> on
> >> >> >>> >> >> > local
> >> >> >>> >> >> > disk
> >> >> >>> >> >> > as
> >> >> >>> >> >> > flow files and will cause double I/O. Which eventually
> may
> >> >> >>> >> >> > cause
> >> >> >>> >> >> > slower
> >> >> >>> >> >> > performance due to I/O bottleneck.
> >> >> >>> >> >> > Is there any way to by pass writing flow files on disk or
> >> >> >>> >> >> > directly
> >> >> >>> >> >> > pass
> >> >> >>> >> >> > those files to HDFS as it is ?
> >> >> >>> >> >> > 1.b If the files on the spool directory are
> >> >> >>> >> >> > compressed(zip/gzip),
> >> >> >>> >> >> > can
> >> >> >>> >> >> > we
> >> >> >>> >> >> > store files on HDFS as uncompressed ?
> >> >> >>> >> >> >
> >> >> >>> >> >> > Question for Scenario 2:
> >> >> >>> >> >> > 2.a Can we use our existing java code for masking ? if
> yes
> >> >> >>> >> >> > then
> >> >> >>> >> >> > how ?
> >> >> >>> >> >> > 2.b For this Scenario we also want to bypass storing flow
> >> >> >>> >> >> > files
> >> >> >>> >> >> > on
> >> >> >>> >> >> > disk.
> >> >> >>> >> >> > Can
> >> >> >>> >> >> > we do it on the fly, masking and storing on HDFS ?
> >> >> >>> >> >> > 2.c If the source files are compressed (zip/gzip), is
> there
> >> >> >>> >> >> > any
> >> >> >>> >> >> > issue
> >> >> >>> >> >> > for
> >> >> >>> >> >> > masking here ?
> >> >> >>> >> >> >
> >> >> >>> >> >> >
> >> >> >>> >> >> > In fact, I tried above using flume+flume interceptors.
> >> >> >>> >> >> > Everything
> >> >> >>> >> >> > working
> >> >> >>> >> >> > fine with smaller files. But when source files greater
> that
> >> >> >>> >> >> > 50MB
> >> >> >>> >> >> > flume
> >> >> >>> >> >> > chocks :(.
> >> >> >>> >> >> > So, I am exploring options in NiFi. Hope I will get some
> >> >> >>> >> >> > guideline
> >> >> >>> >> >> > from
> >> >> >>> >> >> > you
> >> >> >>> >> >> > guys.
> >> >> >>> >> >> >
> >> >> >>> >> >> >
> >> >> >>> >> >> > Thanks in advance.
> >> >> >>> >> >> > -Obaid
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
>

[etl@sthdmgt1-pvt nifi]$ iostat xmh 1
Linux 2.6.32-358.el6.x86_64 (sthdmgt1-pvt.aiu.axiata)   01/14/2016      
_x86_64_        (24 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.52    0.00    0.81    0.04    0.00   92.63

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          32.66    0.00    1.64    1.01    0.00   64.69

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          25.08    0.00    2.31    8.32    0.00   64.29

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          38.27    0.00    3.97    2.87    0.00   54.89

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          45.73    0.00    4.31    0.55    0.00   49.41

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          41.95    0.00    3.50    0.55    0.00   54.01

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          46.30    0.00    3.97    0.25    0.00   49.47

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          39.79    0.00    3.51    0.59    0.00   56.11

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          28.19    0.00    3.38    8.35    0.00   60.08

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          16.61    0.00    3.25   14.07    0.00   66.06

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          29.01    0.00    3.25    7.77    0.00   59.97

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          30.85    0.00    2.61    4.79    0.00   61.75

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          34.20    0.00    1.01    0.13    0.00   64.67

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          30.92    0.00    4.50    2.86    0.00   61.72

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          38.09    0.00    6.55    2.74    0.00   52.62

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          46.81    0.00    7.23    1.90    0.00   44.06

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          23.79    0.00    3.54   13.77    0.00   58.91

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          44.99    0.00    5.71    3.61    0.00   45.70

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          39.31    0.00    5.56    1.73    0.00   53.41

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          40.00    0.00    2.99    0.21    0.00   56.80

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          35.62    0.00    2.40    0.55    0.00   61.43

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          36.13    0.00    3.15    0.55    0.00   60.17

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          34.37    0.00    2.69    1.14    0.00   61.80

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          34.76    0.00    1.89    0.38    0.00   62.97

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          33.52    0.00    1.85    0.42    0.00   64.21

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          34.62    0.00    1.97    0.42    0.00   62.98

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          38.78    0.00    3.51    0.38    0.00   57.32

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          40.58    0.00    3.42    0.25    0.00   55.74

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          38.93    0.00    2.74    0.13    0.00   58.20

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          36.06    0.00    2.06    0.38    0.00   61.50

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          34.61    0.00    2.23    0.55    0.00   62.62

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          38.05    0.00    3.21    0.72    0.00   58.02

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn

^C
[etl@sthdmgt1-pvt nifi]$

[etl@sthdmgt1-pvt nifi]$ jps
27744 FsShell
16016 RunJar
25618 NiFi
17218 RunJar
26423 FsShell
25578 RunNiFi
25996 FsShell
20588 RunJar
27996 FsShell
28477 Jps
[etl@sthdmgt1-pvt nifi]$


[etl@sthdmgt1-pvt nifi]$ jstat -gcutil 25618 1000
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT
  0.00   0.00  16.13  77.46  95.56  89.70 231153  881.441   425   31.966  
913.407
 44.71   0.00  14.93  44.36  95.56  89.70 231203  881.655   426   32.033  
913.688
  0.00  94.33   0.00  73.48  95.56  89.70 231255  881.902   426   32.033  
913.934
  0.00   0.00   0.00  35.49  95.56  89.70 231306  882.111   427   32.109  
914.221
 53.40   0.00   0.00  49.31  95.56  89.70 231359  882.315   427   32.109  
914.424
 66.01   0.00   0.00  84.72  95.56  89.70 231414  882.529   427   32.109  
914.638
 59.13   0.00   0.00  51.85  95.56  89.70 231466  882.731   428   32.176  
914.907
 49.18   0.00   0.00  86.00  95.56  89.70 231520  882.951   428   32.176  
915.127
  0.00  53.68   0.00  49.21  95.56  89.70 231569  883.186   429   32.248  
915.434
  0.00  13.02   0.00  70.31  95.56  89.70 231625  883.398   429   32.248  
915.645
  0.00  97.79   0.00  81.12  95.56  89.70 231680  883.597   429   32.248  
915.845
  0.00  60.55   0.00  49.52  95.56  89.70 231732  883.807   430   32.314  
916.121
  0.00  31.25   0.00  58.35  95.56  89.70 231785  884.016   430   32.314  
916.330
  0.00  39.06   3.62  58.42  95.56  89.70 231836  884.248   430   32.314  
916.562
 29.69   0.00   0.00  58.47  95.56  89.70 231878  884.410   430   32.314  
916.724
  0.00  50.00   0.00  58.55  95.57  89.70 231909  884.535   430   32.314  
916.849
 29.69   0.00  16.45  58.62  95.57  89.70 231950  884.697   430   32.314  
917.011
  0.00  40.62  20.44  58.68  95.57  89.70 231974  884.790   430   32.314  
917.104
 21.88   0.00   0.00  58.74  95.57  89.70 232020  884.983   430   32.314  
917.298
  0.00  53.13  18.02  58.78  95.57  89.70 232055  885.111   430   32.314  
917.425
  0.00  40.62  37.98  58.84  95.57  89.70 232102  885.288   430   32.314  
917.602
  0.00  28.12  41.09  58.92  95.57  89.70 232151  885.465   430   32.314  
917.779
  0.00  54.69  28.17  59.45  95.31  89.70 232199  885.648   430   32.314  
917.962
 37.50   0.00  73.20  59.55  95.31  89.70 232249  885.833   430   32.314  
918.147
  0.00  35.94  16.47  59.60  95.31  89.70 232285  885.979   430   32.314  
918.293
 34.38   0.00   0.00  59.64  95.31  89.70 232338  886.197   430   32.314  
918.511
  0.00  35.94  34.44  59.68  95.31  89.70 232387  886.407   430   32.314  
918.721
  0.00  32.81  66.90  59.74  95.31  89.70 232438  886.608   430   32.314  
918.922
 31.25   0.00  74.34  59.80  95.31  89.70 232493  886.808   430   32.314  
919.122
 56.25   0.00  51.02  59.87  95.31  89.70 232546  887.004   430   32.314  
919.318
 39.06   0.00   0.00  59.91  95.31  89.70 232600  887.193   430   32.314  
919.507
  0.00  51.56   0.00  60.00  95.31  89.70 232653  887.385   430   32.314  
919.699
  0.00  39.06  67.70  60.04  95.31  89.70 232707  887.573   430   32.314  
919.887
^C[etl@sthdmgt1-pvt nifi]$

Re: Data Ingestion forLarge Source Files and Masking

Reply via email to