[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

Marcelo Elias Del Valle (JIRA) Wed, 20 Mar 2013 11:59:18 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608009#comment-13608009
 ]


Marcelo Elias Del Valle commented on MAPREDUCE-2208:
----------------------------------------------------

CSVTextInputFormat was my first try of doing this inputFormat, but I should 
remove it from github later... If you take a look at the example, you will see 
I am only using CSVNLineInputFormat. Please don't consider using this class 
(CSVTextInputFormat) as it probably doesn't work.

Honestly, I would have the same concern you had when considering to use 
CSVTextInputFormat, as looking at getSplits code 
(http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-2/org/apache/hadoop/mapred/FileInputFormat.java#FileInputFormat.getSplits%28org.apache.hadoop.mapred.JobConf%2Cint%29)
 I have the impression the file could be split in the middle of a line, even in 
a case where you have single line text files. I could be wrong, but to the best 
of my knowledge, this is how it works.

However, if you use CSVTextInputFormat overriding the isSplittable() method to 
return FALSE, it could be useful and avoid two parses of the same file, if you 
have 1000s of small files instead of one huge file, like in my case. By doing 
that, you would assure 1 split per file.
                
> Flexible CSV text parser InputFormat
> ------------------------------------
>
>                 Key: MAPREDUCE-2208
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Lance Norskog
>            Priority: Trivial
>         Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java
>
>
> CSVTextInputFormat is a configurable CSV parser tuned to most of the 
> csv-style datasets I've found. The Hadoop samples I've seen all 
> FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key 
> and parse the Text value as a CSV line. But, they are all custom-coded for 
> the format.
> CSVTextInputFormat takes any csv-encoded file and rearrange the fields into 
> the format required by a Mapper. You can drop fields & rearrange them. There 
> is also a random sampling option to make training/test runs easier.
> Attached are CSVTextInputFormat.java and a unit test for it. Both go into 
> org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
> This is compiled against hadoop-0.0.20.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

Reply via email to