[ 
https://issues.apache.org/jira/browse/SPARK-26786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758677#comment-16758677
 ] 

vishnuram selvaraj commented on SPARK-26786:
--------------------------------------------

Thanks [~hyukjin.kwon]. I have raised a git 
issue(https://github.com/uniVocity/univocity-parsers/issues/308) in univocity 
project as well. Will post here of any updates I get from there.

> Handle to treat escaped newline characters('\r','\n') in spark csv
> ------------------------------------------------------------------
>
>                 Key: SPARK-26786
>                 URL: https://issues.apache.org/jira/browse/SPARK-26786
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output, PySpark, SQL
>    Affects Versions: 2.3.0
>            Reporter: vishnuram selvaraj
>            Priority: Major
>
> There are some systems like AWS redshift which writes csv files by escaping 
> newline characters('\r','\n') in addition to escaping the quote characters, 
> if they come as part of the data.
> Redshift documentation 
> link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and 
> below is their mention of escaping requirements in the mentioned link
> ESCAPE
> For CHAR and VARCHAR columns in delimited unload files, an escape character 
> (\{{}}) is placed before every occurrence of the following characters:
>  * Linefeed: {{\n}}
>  * Carriage return: {{\r}}
>  * The delimiter character specified for the unloaded data.
>  * The escape character: \{{}}
>  * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are 
> specified in the UNLOAD command).
>  
> *Problem statement:* 
> But the spark CSV reader doesn't have a handle to treat/remove the escape 
> characters infront of the newline characters in the data.
> It would really help if we can add a feature to handle the escaped newline 
> characters through another parameter like (escapeNewline = 'true/false').
> *Example:*
> Below are the details of my test data set up in a file.
>  * The first record in that file has escaped windows newline character (
>  r
>  n)
>  * The third record in that file has escaped unix newline character (
>  n)
>  * The fifth record in that file has the escaped quote character (")
> the file looks like below in vi editor:
>  
> {code:java}
> "1","this is \^M\
> line1"^M
> "2","this is line2"^M
> "3","this is \
> line3"^M
> "4","this is \" line4"^M
> "5","this is line5"^M{code}
>  
> When I read the file in python's csv module with escape, it is able to remove 
> the added escape characters as you can see below,
>  
> {code:java}
> >>> with open('/tmp/test3.csv','r') as readCsv:
> ... readFile = 
> csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False)
> ... for row in readFile:
> ... print(row)
> ...
> ['1', 'this is \r\n line1']
> ['2', 'this is line2']
> ['3', 'this is \n line3']
> ['4', 'this is " line4']
> ['5', 'this is line5']
> {code}
> But if I read the same file in spark-csv reader, the escape characters 
> infront of the newline characters are not removed.But the escape before the 
> (") is removed.
> {code:java}
> >>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false')
> >>> redDf.show()
> +---+------------------+
> |_c0| _c1|
> +---+------------------+
> \ 1|this is \
> line1|
> | 2| this is line2|
> | 3| this is \
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+------------------+
> {code}
>  *Expected result:*
> {code:java}
> +---+------------------+
> |_c0| _c1|
> +---+------------------+
> | 1|this is 
> line1|
> | 2| this is line2|
> | 3| this is 
> line3|
> | 4| this is " line4|
> | 5| this is line5|
> +---+------------------+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to