[ https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136765#comment-13136765 ]
Harsh J commented on MAPREDUCE-2208: ------------------------------------ I'd suggest reusing OpenCSV instead, if it is possible to. I do think the license is compatible, and it is well maintained. On Thursday, October 27, 2011, Maksym Kovalenko (Commented) (JIRA) < https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136680#comment-13136680] uses comma as a delimiter and happen to have comma in one of the values, for example: 7 columns for the above case instead of 3. In this case according to CSV escaping rules it has to be escaped by another double quote, for example: instead of patterns, one had to provide delimiter character (comma by default) and quote character (double quote by default). Then I and other users won't have to struggle with possible regex patterns (see my questions above, I'm still curious if you can come up with one). any regexes that you need if necessary (if you want to stick to current implementation). By the way, right now you have some fragility in the implementation when you prepend user provided regex with a "\\". This will break in case when user supplied pattern itself starts with "\\". csv-style datasets I've found. The Hadoop samples I've seen all FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key and parse the Text value as a CSV line. But, they are all custom-coded for the format. into the format required by a Mapper. You can drop fields & rearrange them. There is also a random sampling option to make training/test runs easier. org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src. administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa -- Harsh J > Flexible CSV text parser InputFormat > ------------------------------------ > > Key: MAPREDUCE-2208 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Reporter: Lance Norskog > Priority: Trivial > Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java > > > CSVTextInputFormat is a configurable CSV parser tuned to most of the > csv-style datasets I've found. The Hadoop samples I've seen all > FileInputFormat and Mapper<LongWritable,Text>. They drop the Longwritable key > and parse the Text value as a CSV line. But, they are all custom-coded for > the format. > CSVTextInputFormat takes any csv-encoded file and rearrange the fields into > the format required by a Mapper. You can drop fields & rearrange them. There > is also a random sampling option to make training/test runs easier. > Attached are CSVTextInputFormat.java and a unit test for it. Both go into > org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src. > This is compiled against hadoop-0.0.20. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira