Data Ingestion forLarge Source Files and Masking

obaidul karim Fri, 01 Jan 2016 01:18:52 -0800

Hi,

I am new in Nifi and exploring it as open source ETL tool.


As per my understanding, flow files are stored on local disk and it
contains actual data.
If above is true, lets consider a below scenario:

Scenario 1:
- In a spool directory we have terabytes(5-6TB/day) of files coming from
external sources
- I want to push those files to HDFS as it is without any changes

Scenario 2:
- In a spool directory we have terabytes(5-6TB/day) of files coming from
external sources
- I want to mask some of the sensitive columns
- Then send one copy to HDFS and another copy to Kafka

Question for Scenario 1:
1.a In that case those 5-6TB data will be again written on local disk as
flow files and will cause double I/O. Which eventually may cause slower
performance due to I/O bottleneck.
Is there any way to by pass writing flow files on disk or directly pass
those files to HDFS as it is ?
1.b If the files on the spool directory are compressed(zip/gzip), can we
store files on HDFS as uncompressed ?

Question for Scenario 2:
2.a Can we use our existing java code for masking ? if yes then how ?
2.b For this Scenario we also want to bypass storing flow files on disk.
Can we do it on the fly, masking and storing on HDFS ?
2.c If the source files are compressed (zip/gzip), is there any issue for
masking here ?


In fact, I tried above using flume+flume interceptors. Everything working
fine with smaller files. But when source files greater that 50MB flume
chocks :(.
So, I am exploring options in NiFi. Hope I will get some guideline from you
guys.


Thanks in advance.
-Obaid

Data Ingestion forLarge Source Files and Masking

Reply via email to