Re: Suggestions for good approach to ETL strategy

Lee Laim Sat, 24 Oct 2015 23:28:04 -0700

Mark,

I'm far from seasoned but I'll take a swing at it to check my understanding
(or lack thereof).
I'd break the task  into 2 parts:


Identify and move  files to a staging location, then process the zip files
from the staging location.


Flow1:

Run a cron driven* GenerateFlowFile* Processor to start the process every
24 hours after 8AM  ->

*ExecuteStreamCommand* to run your bash script to stream the the 160 files
of interest into ->

*SplitText processor* to generate a new flow file for each zip filename.
This can be routed into a

*DistributeLoad processor* which can will distribute the flowfiles to

*ExtractText processors  *to extract the text out of the flowfile (extract
contents: filename and path), then pass to

*UpdateAttribute *of the flow file to be able to access the filename and
path via Nifi expression language.  Pass flow file to

*ExecuteStreamProcess*(cp /${path_attribute}/${filename}
/location2/${filename} ) * this will copy the zipfile to a another
directory(location2), to keep files at the source for other users.*



Flow2:
       *  GetFile from location2* -> *Unpack Contents* ->*
RouteOnAttribute* (to
select CSV of interest, discard the rest)  -> (*ExecuteStreamProcess *(sed
'1d') to remove header -> *CompressContent* ->* PutHDFS*



Hope this helps, and I hope this isn't too far off.

Thanks,
Lee

On Sat, Oct 24, 2015 at 10:25 PM, Mark Petronic <markpetro...@gmail.com>
wrote:

> Reading some other posts, stumbled on this JIRA [1] which seems to
> directly relate to my question in this post.
>
> [1] https://issues.apache.org/jira/browse/NIFI-631
>
> On Sat, Oct 24, 2015 at 11:44 PM, Mark Petronic <markpetro...@gmail.com>
> wrote:
> > So, I stumbled onto Nifi at a Laurel, MD Spark meetup and was pretty
> > excited about using it. I'm running HDP and need to construct an ETL
> > like flow and would like to try to start, as a new user to Nifi, using
> > a "best practice" approach. Wondering if some of you more seasoned
> > users might provide some thoughts on my problem?
> >
> > 1. 160 zip files/day show up on an NFS share in various sub
> > directories and their filenames contain the yyyymmddHHMMSS of when the
> > stats where generated.
> > 2. Each zip file contains 4 or more large CSV files
> > 3. I need just one of those CSVs from each zip file each day and they
> > all add up to about 10GB uncompressed
> > 4. I need to extract that one file from each zip, strip off the first
> > line (the headers), and store it in HDFS compressed again using gzip
> > or snappy
> > 5. I cannot delete the NFS file after the copy to HDFS because others
> > need access to it for some time
> >
> > So, where I am having a hard time visualizing doing this in Nifi is
> > with the first step. I need to scan the NFS files after 8 AM every day
> > (when I know all files for the previous 24 hours will be present),
> > find that set of files for that day using the yyymmdd part of file
> > names, then perform the extract of the one file I need and process it
> > into HDFS.
> >
> > I could imagine a processor that runs once every 24 hours on a cron
> > schedule. I could imaging running an ExecuteProcess processor against
> > a bash script to get the list of all the files that match the
> > yyyymmdd. Then I get stuck. How to take this list of 160 file paths
> > and start the job of processing each one of them in parallel to run
> > the ETL flow?
> >
> > Thanks in advance for any ideas
>

Re: Suggestions for good approach to ETL strategy

Reply via email to