Re: MergeContent/SplitText - Performance against large CSV files

Mark Payne Fri, 13 Nov 2015 09:26:47 -0800

Mark,

Ok thanks for the more detailed explanation. I think that makes RouteText a 
much more appealing solution. It is available on master now.


Thanks
-Mark

Sent from my iPhone

> On Nov 13, 2015, at 12:12 PM, Mark Petronic <[email protected]> wrote:
> 
> Thank you, Mark, for the quick reply. My comments on your comments...
> 
> "That's a great question! 200 million per day equates to about 2K - 3K per 
> second."
> 
> Unfortunately, the rate will be much more extreme. Those records all show up 
> over the course of about 4 hours. So, every time a new zip file appears on my 
> NFS share, I grab it and process it. About 160 files will appear over those 4 
> hours. For example, the average sized zip file would likely contain about 
> 180,000 or 1,600,000 records to process in that one scheduled Nifi run. And, 
> it is very likely that, for a given run, say scheduled to run every 30 
> minutes, there could be multiple files to process. I did not mention it 
> before but there are really two types of very large CSV files I have to 
> process here, one is 18M records and the other 200M records per day. So, 
> traffic is very bursty.
> 
> Does that change anything regarding the intended use case for the new 
> RouteText
> 
> "We should have a RouteCSV processor as well."
> 
> That would be very nice and definitely more performant without the need for 
> regex matching. However, I definitely would benefit even from RouteText, in 
> the interim. For my use case, the regex will be pretty simple as the 
> timestamp is close to the front of the record, but I see where you are going 
> on the potential complexity with groups and widely spread out fields of 
> interest.
> 
> Is RouteText available on any branch where I could build and play around with 
> it before 0.4.0?
> 
> Thanks,
> Mark
> 
>> On Fri, Nov 13, 2015 at 11:14 AM, Mark Payne <[email protected]> wrote:
>> Mark,
>> 
>> That's a great question! 200 million per day equates to about 2K - 3K per 
>> second. So that is quite reasonable.
>> You are very correct, though, that splitting that CSV into tons of one-line 
>> FlowFiles does indeed have a cost.
>> Specifically, the big cost is the Provenance data that is generated at that 
>> rate. But again 2K - 3K per second
>> going through a handful of Processors is a very reasonable workload.
>> 
>> I will caution you, though, that there is a ticket [1] where people will 
>> sometimes run into Out Of Memory Errors
>> if they try to split a huge CSV into individual FlowFiles because it holes 
>> all of those FlowFile objects (not the
>> data itself but the attributes) in memory until the session is committed. 
>> The workaround for this (until that ticket
>> is completed) is to use a SplitText to split into 10,000 lines or so per 
>> FlowFile and then another SplitText to
>> split each of those smaller ones into 1-line FlowFiles.
>> 
>> Also of note, in 0.4.0, which is expected to be released in around a week or 
>> so, there is a new RouteText
>> Processor. This, I think, will make your life far easier. Rather than using 
>> SplitText, Extract Text, and MergeContent
>> in order to group the text, RouteText will allow you to supply a Grouping 
>> Regex. So that regex can just pull out the
>> device id, year, month, and day, from each line and group together lines of 
>> text that have the same values into
>> a single FlowFile. For instance, if your CSV looked like:
>> 
>> # device_id, device_manufacturer, value, year, month, day, hour
>> 1234, Famous Manufacturer, 83, 2015, 11, 13, 12
>> 
>> You could define a grouping regex as:
>> (\d+), .*?, .*?, (\d+), (\d+), (\d+), .*
>> 
>> It looks complex but it's just breaking apart the CSV into individual fields 
>> and grouping on device_id, year, month, day.
>> This will also create a RouteText.Group attribute with the value "1234, 
>> 2015, 11, 13"
>> 
>> This processor provides two benefits: it combines all of the grouping into a 
>> single Processor, and it cuts down on the
>> millions of FlowFiles that are generated and then merged back together.
>> 
>> As I write this, though, I am realizing that the regex above is quite a 
>> pain. We should have a RouteCSV processor as well.
>> Though it won't provide any features that RouteText can't provide, it will 
>> make configuration far easier. I created a ticket
>> for this here [2]. I'm not sure that it will make it into the 0.4.0 release, 
>> though.
>> 
>> I hope this helps!
>> 
>> Thanks
>> -Mark
>> 
>> [1] https://issues.apache.org/jira/browse/NIFI-1008
>> [2] https://issues.apache.org/jira/browse/NIFI-1161
>> 
>> 
>>> On Nov 13, 2015, at 10:44 AM, Mark Petronic <[email protected]> wrote:
>>> 
>>> I have a concept question. Say I have 10 GB of CSV files/day containing 
>>> records where 99% of them are from the last couple days but there are 
>>> stragglers that are late reported that can date back many months. Devices 
>>> that are powered off at times don't report but eventually do when powered 
>>> on and report old, captured data - store and forward kind of thing. I want 
>>> to capture all the data for historic reasons. There are about 200 million 
>>> records per day to process. I want to put this data in Hive tables that are 
>>> partitioned by year, month, and day and use ORC columnar storage. These 
>>> tables are external Hive tables and point to the directories where I want 
>>> to drop these files on HDFS, manually add new partitions, as needed, and 
>>> immediately be able to query using HQL.
>>> 
>>> Nifi Concept: 
>>> 
>>> 0. Use GetText to get a CSV file
>>> 1. Use UpdateAttribute to parse the incoming CSV file name to obtain a 
>>> device_id and set that as an attribute on the flow file
>>> 2. Use SplitText to split each row into a flow file. 
>>> 3. Use ExtractText to identify the field that is the record timestamp and 
>>> create a year, month, and day attribute on the flow file from that 
>>> timestamp. 
>>> 4. Use MergeContent with a grouping key made up of 
>>> (device_id,year,month,day)
>>> 5. Convert each file to ORC (many Parquet). This stage will likely require 
>>> me building a custom processor because the conversion is not going to be a 
>>> simple A-to-B. I want to do some validation on fields, type conversion, and 
>>> other custom stuff against some source-of-truth schema stored in a web 
>>> service with REST API.
>>> 6. Use PutHDFS to store these ORC files in directories like 
>>> .../device_id/year=2015/month=11/day=9 by using the attributes already 
>>> present from the upstream processors to build up the path, where device_id 
>>> is the Hive table name and the year, month, day are the partition key 
>>> name=value per Hive format. The file names will just be some unique ID, 
>>> they don't really matter
>>> 7. Use ExecuteProcessStream to execute a Hive script that will "alter table 
>>> add partitions...." for any partitions that were newly created on this 
>>> schedule run
>>> 
>>> Is this insane or is it what Nifi was designed to do? I could definitely 
>>> see using a Spark job to do the group by (device_id,year,month,day) stage. 
>>> 200M flow files from the SplitText is the one that has me wondering if I am 
>>> nuts thinking of doing that? There must be overhead on flow files and 
>>> deprecating them to one line each seems to me as a worst case scenario. But 
>>> it all depends on the core design and whether Nifi is optimized to handle 
>>> such a use case. 
>>> 
>>> Thanks,
>>> Mark
>

Re: MergeContent/SplitText - Performance against large CSV files

Reply via email to