CSV files as input

Keith Wiley Wed, 22 Feb 2012 11:01:38 -0800

It seems nearly impossible to use CSV files as Hadoop input.  I see that there 
is a CsvRecordInput class, but have found virtually no examples online of how 
to use it...and the one example I did find blatantly assumed that the CSV 
records were delimited by endlines...which is not CSV spec.  Based on my 
analysis below, I don't see how CSV input is possible, so I don't understand 
how CsvRecordInput can work (and I am having trouble understanding the 
completely undocumented CsvRecordInput.java; It isn't clear how that class is 
intended to be used).  If CsvRecordInput solves all my problems, then great, 
but how do I use it?


I need to process CSV files which will almost certainly contain quoted 
endlines.  I have attempted to derive my own record reader for this task and 
conclude that it is virtually impossible without reading from the beginning of 
the file.  I explain below.

Consider this: Assuming a split starts at some arbitrary point in the file, the 
standard record reader approach would be to initialize the record reader by 
reading to the end of the current mid-record and beginning the record reader at 
the start of the next full record...but there is no way to positively identify 
the end of CSV record if you start at an arbitrary location without potentially 
reading to the end of the file!

For example, we must consider the possibility that the split begins in the 
middle of a quoted string (therefore, endlines do not delimit records because 
they may be within a string).  We must therefore scan for a possible end-quote 
to close the string, but if we *didn't* begin within a string there may *be no 
end-quote at all* (the entire CSV file might not contain a single quoted 
string).  The only way to identify that we did not begin within a quoted string 
is to scan to the end of the CSV file (not the end of the *split* mind you).

So, initializing a CSV record reader with absolute error-free confidence 
potentially requires reading not only the entire split at the time of 
initialization (grossly inefficient in itself), but potentially requires 
reading the entire file, which may not even reside on the current node!

I'm at a loss.  How can Hadoop take CSV files as input?  It must be possible.  
CSV is a very plain and common way to arrange textual data, which is Hadoop's 
forte; I'm sure people are processing CSV data with Hadoop, it seems like a 
natural fit...but I can't imagine how to enable Hadoop to read it under the 
conditions of Hadoop file splits.

Blech.  Help!

________________________________________________________________________________
Keith Wiley     kwi...@keithwiley.com     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________

CSV files as input

Reply via email to