MergeContent/SplitText - Performance against large CSV files

Mark Petronic Fri, 13 Nov 2015 07:45:19 -0800

I have a concept question. Say I have 10 GB of CSV files/day containing
records where 99% of them are from the last couple days but there are
stragglers that are late reported that can date back many months. Devices
that are powered off at times don't report but eventually do when powered
on and report old, captured data - store and forward kind of thing. I want
to capture all the data for historic reasons. There are about 200 million
records per day to process. I want to put this data in Hive tables that are
partitioned by year, month, and day and use ORC columnar storage. These
tables are external Hive tables and point to the directories where I want
to drop these files on HDFS, manually add new partitions, as needed, and
immediately be able to query using HQL.


Nifi Concept:

0. Use GetText to get a CSV file
1. Use UpdateAttribute to parse the incoming CSV file name to obtain a
device_id and set that as an attribute on the flow file
2. Use SplitText to split each row into a flow file.
3. Use ExtractText to identify the field that is the record timestamp and
create a year, month, and day attribute on the flow file from that
timestamp.
4. Use MergeContent with a grouping key made up of
(device_id,year,month,day)
5. Convert each file to ORC (many Parquet). This stage will likely require
me building a custom processor because the conversion is not going to be a
simple A-to-B. I want to do some validation on fields, type conversion, and
other custom stuff against some source-of-truth schema stored in a web
service with REST API.
6. Use PutHDFS to store these ORC files in directories like
.../device_id/year=2015/month=11/day=9 by using the attributes already
present from the upstream processors to build up the path, where device_id
is the Hive table name and the year, month, day are the partition key
name=value per Hive format. The file names will just be some unique ID,
they don't really matter
7. Use ExecuteProcessStream to execute a Hive script that will "alter table
add partitions...." for any partitions that were newly created on this
schedule run

Is this insane or is it what Nifi was designed to do? I could definitely
see using a Spark job to do the group by (device_id,year,month,day) stage.
200M flow files from the SplitText is the one that has me wondering if I am
nuts thinking of doing that? There must be overhead on flow files and
deprecating them to one line each seems to me as a worst case scenario. But
it all depends on the core design and whether Nifi is optimized to handle
such a use case.

Thanks,
Mark

MergeContent/SplitText - Performance against large CSV files

Reply via email to