Suggestions for good approach to ETL strategy

Mark Petronic Sat, 24 Oct 2015 20:46:07 -0700

So, I stumbled onto Nifi at a Laurel, MD Spark meetup and was pretty
excited about using it. I'm running HDP and need to construct an ETL
like flow and would like to try to start, as a new user to Nifi, using
a "best practice" approach. Wondering if some of you more seasoned
users might provide some thoughts on my problem?


1. 160 zip files/day show up on an NFS share in various sub
directories and their filenames contain the yyyymmddHHMMSS of when the
stats where generated.
2. Each zip file contains 4 or more large CSV files
3. I need just one of those CSVs from each zip file each day and they
all add up to about 10GB uncompressed
4. I need to extract that one file from each zip, strip off the first
line (the headers), and store it in HDFS compressed again using gzip
or snappy
5. I cannot delete the NFS file after the copy to HDFS because others
need access to it for some time

So, where I am having a hard time visualizing doing this in Nifi is
with the first step. I need to scan the NFS files after 8 AM every day
(when I know all files for the previous 24 hours will be present),
find that set of files for that day using the yyymmdd part of file
names, then perform the extract of the one file I need and process it
into HDFS.

I could imagine a processor that runs once every 24 hours on a cron
schedule. I could imaging running an ExecuteProcess processor against
a bash script to get the list of all the files that match the
yyyymmdd. Then I get stuck. How to take this list of 160 file paths
and start the job of processing each one of them in parallel to run
the ETL flow?

Thanks in advance for any ideas

Suggestions for good approach to ETL strategy

Reply via email to