subject:"\[jira\] \[Commented\] \(MAPREDUCE\-2208\) Flexible CSV text parser InputFormat"

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

2013-03-20 Thread Christian Tzolov (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607669#comment-13607669
]

Christian Tzolov commented on MAPREDUCE-2208:
-

Hi Marcelo, the multiline CSVInputFormat inherits the getSplits()
implementation from the parent FileInputFormat. Therefore I see a potential
risk of splitting one multiline record across two (or more) different splits.
Is this a valid concern or I might be missing something?

Flexible CSV text parser InputFormat

Key: MAPREDUCE-2208
URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
Project: Hadoop Map/Reduce
Issue Type: New Feature
Reporter: Lance Norskog
Priority: Trivial
Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java

CSVTextInputFormat is a configurable CSV parser tuned to most of the
csv-style datasets I've found. The Hadoop samples I've seen all
FileInputFormat and MapperLongWritable,Text. They drop the Longwritable key
and parse the Text value as a CSV line. But, they are all custom-coded for
the format.
CSVTextInputFormat takes any csv-encoded file and rearrange the fields into
the format required by a Mapper. You can drop fields rearrange them. There
is also a random sampling option to make training/test runs easier.
Attached are CSVTextInputFormat.java and a unit test for it. Both go into
org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
This is compiled against hadoop-0.0.20.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

2013-03-20 Thread Marcelo Elias Del Valle (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607716#comment-13607716
]

Marcelo Elias Del Valle commented on MAPREDUCE-2208:

Christian, this is a valid concern. Actually, when I created the first version
of this input format, I had chosen to have the CSV line numbers as the keys.
Indeed, it worked well until I tested it on a cluster (amazon EMR with 15
instances). When I did, I realized the line numbers wasn't a good key, as it
wouldn't get the right results among cluster nodes.
I fixed that to use the file position as input key, just as NLineInputFormat
does
(http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html)
I have tested it a lot and so far I found no problems. However, if you find
some problem I didn't see, please tell me, as I would be very interested in
fixing it.

Flexible CSV text parser InputFormat

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

2013-03-20 Thread Marcelo Elias Del Valle (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607724#comment-13607724
]

Marcelo Elias Del Valle commented on MAPREDUCE-2208:

Oh, just to complement, I realized you possible meant something different from
your question... You are concerned about a single CSV line be split in two
among different splits, right? No, that won't happen because I wrote a custom
reader, that reads N lines at a time. The getSplits method uses the reader to
correct get N lines and perform the splits, so getSplits will never return half
of a line, you can actually configure how many lines you want on each split.
Yes, this is also a valid concern and I took care about it. I am sorry, I
hadn't understood well your question the first time I read it.

Flexible CSV text parser InputFormat

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

2013-03-20 Thread Christian Tzolov (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13607966#comment-13607966
]

Christian Tzolov commented on MAPREDUCE-2208:
-

Ah, I've only looked at the CSVTextInputFormat, which doesn't override the
getSplits(). CSVNLineInputFormat does indeed.

So the CSVNLineInputFormat implementation reads the entire data set twice? Once
to compute the splits and second pass for the actual read in the map tasks.
While the double-passing approach is unavoidable (IMO) I wonder what is the
performance (and perhaps the scalability) impact. Do you have any numbers
comparing the standart vs. multiline implementations?
Thanks, Chris

Flexible CSV text parser InputFormat

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

2013-03-20 Thread Marcelo Elias Del Valle (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13608009#comment-13608009
]

Marcelo Elias Del Valle commented on MAPREDUCE-2208:

CSVTextInputFormat was my first try of doing this inputFormat, but I should
remove it from github later... If you take a look at the example, you will see
I am only using CSVNLineInputFormat. Please don't consider using this class
(CSVTextInputFormat) as it probably doesn't work.

Honestly, I would have the same concern you had when considering to use
CSVTextInputFormat, as looking at getSplits code
(http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-2/org/apache/hadoop/mapred/FileInputFormat.java#FileInputFormat.getSplits%28org.apache.hadoop.mapred.JobConf%2Cint%29)
I have the impression the file could be split in the middle of a line, even in
a case where you have single line text files. I could be wrong, but to the best
of my knowledge, this is how it works.

However, if you use CSVTextInputFormat overriding the isSplittable() method to
return FALSE, it could be useful and avoid two parses of the same file, if you
have 1000s of small files instead of one huge file, like in my case. By doing
that, you would assure 1 split per file.

Flexible CSV text parser InputFormat

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

2013-01-25 Thread Marcelo Elias Del Valle (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13562696#comment-13562696
]

Marcelo Elias Del Valle commented on MAPREDUCE-2208:

Created an improved version of a CSVInputFormat, able to read multiline CSVs,
just in case it interests: https://github.com/mvallebr/CSVInputFormat

Flexible CSV text parser InputFormat

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

2011-10-26 Thread Maksym Kovalenko (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13136680#comment-13136680
]

Maksym Kovalenko commented on MAPREDUCE-2208:
-

So what regex one would need to specify to parse the normal CSV that uses
comma as a delimiter and happen to have comma in one of the values, for example:

value1,value2,more,complex,with,commas,value3

just providing , as the pattern1 will no longer work as it will produce 7
columns for the above case instead of 3.

Also consider the following use case when value contains a double quoute. In
this case according to CSV escaping rules it has to be escaped by another
double quote, for example:

column1,thank you, User for the report, again, thank you,column3

Considering above two cases what value for pattern1 should I provide?

I think configuration of CSVTextInputFormat would be more natural if instead of
patterns, one had to provide delimiter character (comma by default) and quote
character (double quote by default). Then I and other users won't have to
struggle with possible regex patterns (see my questions above, I'm still
curious if you can come up with one).

Another benefit is that from delimiter and quote characters you can create any
regexes that you need if necessary (if you want to stick to current
implementation). By the way, right now you have some fragility in the
implementation when you prepend user provided regex with a \\. This will
break in case when user supplied pattern itself starts with \\.

Flexible CSV text parser InputFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

2011-10-26 Thread Harsh J (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13136765#comment-13136765
]

Harsh J commented on MAPREDUCE-2208:

I'd suggest reusing OpenCSV instead, if it is possible to. I do think the
license is compatible, and it is well maintained.

On Thursday, October 27, 2011, Maksym Kovalenko (Commented) (JIRA)
https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13136680#comment-13136680]
uses comma as a delimiter and happen to have comma in one of the values, for
example:
7 columns for the above case instead of 3.
In this case according to CSV escaping rules it has to be escaped by another
double quote, for example:
instead of patterns, one had to provide delimiter character (comma by
default) and quote character (double quote by default). Then I and other
users won't have to struggle with possible regex patterns (see my questions
above, I'm still curious if you can come up with one).
any regexes that you need if necessary (if you want to stick to current
implementation). By the way, right now you have some fragility in the
implementation when you prepend user provided regex with a \\. This will
break in case when user supplied pattern itself starts with \\.
csv-style datasets I've found. The Hadoop samples I've seen all
FileInputFormat and MapperLongWritable,Text. They drop the Longwritable
key and parse the Text value as a CSV line. But, they are all custom-coded
for the format.
into the format required by a Mapper. You can drop fields rearrange them.
There is also a random sampling option to make training/test runs easier.
org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa

--
Harsh J

Flexible CSV text parser InputFormat

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

2011-07-22 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13069451#comment-13069451
]

Lance Norskog commented on MAPREDUCE-2208:
--

Hadoop assumes that it will process several files of the same format. Will
every CSV file have the same header? If you split a giant CSV file into many
pieces, will you reproduce the header line on the 2nd through N file?

Hadoop jobs are generally configured with total knowledge of the data. The
mappers are hard-coded for the input formats.

The code could include a rule for how to decide that the first line is a header
and skip over it. That would be worth adding.

Flexible CSV text parser InputFormat

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

2011-07-22 Thread XiaoboGu (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13069490#comment-13069490
 ] 

XiaoboGu commented on MAPREDUCE-2208:
-

There are two senarioes,
1. Single huge CSV file with header.
2. Many middle CSV files with the same format and header.

 Flexible CSV text parser InputFormat
 

 Key: MAPREDUCE-2208
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Lance Norskog
Priority: Trivial
 Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java


 CSVTextInputFormat is a configurable CSV parser tuned to most of the 
 csv-style datasets I've found. The Hadoop samples I've seen all 
 FileInputFormat and MapperLongWritable,Text. They drop the Longwritable key 
 and parse the Text value as a CSV line. But, they are all custom-coded for 
 the format.
 CSVTextInputFormat takes any csv-encoded file and rearrange the fields into 
 the format required by a Mapper. You can drop fields  rearrange them. There 
 is also a random sampling option to make training/test runs easier.
 Attached are CSVTextInputFormat.java and a unit test for it. Both go into 
 org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
 This is compiled against hadoop-0.0.20.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

2011-07-21 Thread XiaoboGu (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13068997#comment-13068997
 ] 

XiaoboGu commented on MAPREDUCE-2208:
-

How do you handle CSV file header, or is it not supported?

 Flexible CSV text parser InputFormat
 

 Key: MAPREDUCE-2208
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Lance Norskog
Priority: Trivial
 Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java


 CSVTextInputFormat is a configurable CSV parser tuned to most of the 
 csv-style datasets I've found. The Hadoop samples I've seen all 
 FileInputFormat and MapperLongWritable,Text. They drop the Longwritable key 
 and parse the Text value as a CSV line. But, they are all custom-coded for 
 the format.
 CSVTextInputFormat takes any csv-encoded file and rearrange the fields into 
 the format required by a Mapper. You can drop fields  rearrange them. There 
 is also a random sampling option to make training/test runs easier.
 Attached are CSVTextInputFormat.java and a unit test for it. Both go into 
 org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
 This is compiled against hadoop-0.0.20.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (MAPREDUCE-2208) Flexible CSV text parser InputFormat

2010-12-10 Thread Lance Norskog (JIRA)

[
https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12970370#action_12970370
]

Lance Norskog commented on MAPREDUCE-2208:
--

Another use case: one Wikipedia format is:
{code}
1: 1664968
2: 3 747213 1664968 1691047 4095634 5535664
{code}
which would read in as:
{code}
1: 1664968
2: 3
2: 747213
2: 1664968
etc.
{code}

Flexible CSV text parser InputFormat

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-2208) Flexible CSV text parser InputFormat

2010-12-03 Thread Allen Wittenauer (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1299#action_1299
 ] 

Allen Wittenauer commented on MAPREDUCE-2208:
-

Any chance this could get changed to CombineFile/MultiFile instead?

 Flexible CSV text parser InputFormat
 

 Key: MAPREDUCE-2208
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2208
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Lance Norskog
Priority: Trivial
 Attachments: CSVTextInputFormat.java, TestCSVTextFormat.java


 CSVTextInputFormat is a configurable CSV parser tuned to most of the 
 csv-style datasets I've found. The Hadoop samples I've seen all 
 FileInputFormat and MapperLongWritable,Text. They drop the Longwritable key 
 and parse the Text value as a CSV line. But, they are all custom-coded for 
 the format.
 CSVTextInputFormat takes any csv-encoded file and rearrange the fields into 
 the format required by a Mapper. You can drop fields  rearrange them. There 
 is also a random sampling option to make training/test runs easier.
 Attached are CSVTextInputFormat.java and a unit test for it. Both go into 
 org.apache.hadoop.mapreduce.lib.input under src/java and test/mapred/src.
 This is compiled against hadoop-0.0.20.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

[jira] [Commented] (MAPREDUCE-2208) Flexible CSV text parser InputFormat

[jira] Commented: (MAPREDUCE-2208) Flexible CSV text parser InputFormat

[jira] Commented: (MAPREDUCE-2208) Flexible CSV text parser InputFormat

13 matches

Site Navigation

Mail list logo

Footer information