[ 
https://issues.apache.org/jira/browse/PIG-4623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659175#comment-14659175
 ] 

Rohini Palaniswamy commented on PIG-4623:
-----------------------------------------

The patch has a couple of bugs
   -  So if you encounter EOF while looping you will return null instead of 
returning what has been processed so far in previousBufferToBeRead.
{code}
while(!doneThisLineLogically)   {
+                   if (!in.nextKeyValue()) {
+                       return null;
+                   }
{code}
   - Also I don't see how you are handling the case if it was not a quoted 
field but quote was just in the data. For eg: 
FieldA1,fieldB1,fieldC1,fieldD1apples"oranges\nFieldA2,fieldB2,fieldC2,fieldD2 
would now output
FieldA1,fieldB1,fieldC1,fieldD1apples"oranges\nFieldA2   and drop 
fieldB2,fieldC2,fieldD2 altogether.  The previous code would correctly output 2 
records.
FieldA1,fieldB1,fieldC1,fieldD1apples"oranges
FieldA2,fieldB2,fieldC2,fieldD2

   - The patch goes ahead and reads the next line if it sees a quote that does 
not end. It does not handle records boundaries across tasks. There will be 
duplicate partial records if a line with quotes spans split boundaries. For eg: 
For the data
FieldA1,"fieldB1 
apples\noranges",fieldC1,fieldD1\nFieldA2,fieldB2,fieldC2,fieldD2 with split 
boundary ending at B1, TextInputFormat of task1 will read  {{FieldA1,"fieldB1 
apples}}. The current patch would go ahead and read next line  
{{oranges",fieldC1,fieldD1}}.  The TextInputFormat of task2 will also read 
{{oranges",fieldC1,fieldD1}} and output that as record.
    




> Fixed the 'new line' character inside double-quote causing the csv parsing 
> failure
> ----------------------------------------------------------------------------------
>
>                 Key: PIG-4623
>                 URL: https://issues.apache.org/jira/browse/PIG-4623
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.15.0
>            Reporter: Ken Wu
>            Assignee: Ken Wu
>             Fix For: 0.16.0
>
>         Attachments: CSVLoader.java, PIG-4623-1.patch, TestCSVStorage.java
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A new line character should be allowed inside a double quote as a valid csv 
> document. For example, the following csv document should be treated as a 
> SINGLE valid csv data
> Iphone,"{ ItemName : Cheez-It
> 21 Ounce}",
> However, the current implementation of the getNext() inside 
> org.apache.pig.piggybank.storage.CSVLoader class fails to take care of this 
> case and it sees two lines of data while in fact it should be treated as 
> single line of data.
> This pull request fixes the above issue.
> (Note: here is a linke to validate whether a csv document: http://csvlint.io/)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to