[jira] [Commented] (CRUNCH-414) The CSV file source needs to be a little more robust when handling multi-line CSV files

Brandon Inman (JIRA) Wed, 25 Jun 2014 15:09:06 -0700

    [ 
https://issues.apache.org/jira/browse/CRUNCH-414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044111#comment-14044111
 ]


Brandon Inman commented on CRUNCH-414:
--------------------------------------

{quote} Not sure I understand about moving the .set/stringBuilder stuff. {quote}
Maybe I can communicate it better in code (again, treat as psuedocode, this 
isn't tested for compiling or completeness)...

{code}
    StringBuilder builder = new StringBuilder();
    do {
      totalBytesConsumed += readFileLine(inputText);
      // a line has been read. We need to see if we're still in quotes and tack
      // on a newline if so
      builder.append(inputText.toString());
      // TODO: endOfFile check may not be necessary here
      if (currentlyInQuotes && !endOfFile) {
        // Add one LF to mark the line return, otherwise any multi-line CSV
        // record will all be on one line.
        builder.append('\n').toString();
     }
      if (totalBytesConsumed > QUOTED_SECTION_THRESHOLD_VALUE ) {
        throw new IOException("Too many bytes consumed before newline: " + 
totalBytesConsumed);
      }
    } while (currentlyInQuotes && !endOfFile);
    
    inputText.set(builder.toString());

    input.set(inputText);
    return (int) totalBytesConsumed;
  }
{code}

This may have the potential to lose an ending LF, but I think that was already 
the case.

{quote} Can you think of a situation where one CSV record would be larger than 
the size of the pieces the CSV file should be split into? {quote}

I'm actually curious how Hadoop/Crunch generally deals with records larger than 
a split size. While it's not going to be a common use case, I can see the 
possibility for a CSV to have extremely large escaped sections that could 
exceed 64mb (Base64 image data? raw sensor data? XML documents in a database?)  
Ultimately, if it's configurable, the default is less important and keeping it 
around split size should be pretty sensible, since someone would likely be 
adjusting that if they know that they are processing huge records.

{quote}As for evenly-malformed files, you're right, they won't trigger an 
exception here, but will have be dealt with either manually or by more detailed 
parsing after these lines are read.{quote}

My thought is that will probably require the same kind of anomaly detection 
that would normally be applied to detect other types bad data.


> The CSV file source needs to be a little more robust when handling multi-line 
> CSV files
> ---------------------------------------------------------------------------------------
>
>                 Key: CRUNCH-414
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-414
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: mac champion
>            Assignee: mac champion
>            Priority: Minor
>              Labels: csv, csvparser
>             Fix For: 0.8.4
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Brandon Inman recently reported an undesriable behavior in the CSV file 
> source group of files. Currently, the CSVLineReader, if reading a malformed 
> CSV file, can enter a state where it is perpetually waiting for an end-quote 
> character. As he put it, "Malformed files are malformed files and should 
> probably fail in some regard, but a hang is obviously undesirable." 
> Essentially, the CSVLineReader needs to be tweaked in such a way that an 
> informative exception is thrown after some threshold is reached, instead of 
> basically just hanging. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CRUNCH-414) The CSV file source needs to be a little more robust when handling multi-line CSV files

Reply via email to