[ https://issues.apache.org/jira/browse/SPARK-26786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758677#comment-16758677 ]
vishnuram selvaraj commented on SPARK-26786: -------------------------------------------- Thanks [~hyukjin.kwon]. I have raised a git issue(https://github.com/uniVocity/univocity-parsers/issues/308) in univocity project as well. Will post here of any updates I get from there. > Handle to treat escaped newline characters('\r','\n') in spark csv > ------------------------------------------------------------------ > > Key: SPARK-26786 > URL: https://issues.apache.org/jira/browse/SPARK-26786 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark, SQL > Affects Versions: 2.3.0 > Reporter: vishnuram selvaraj > Priority: Major > > There are some systems like AWS redshift which writes csv files by escaping > newline characters('\r','\n') in addition to escaping the quote characters, > if they come as part of the data. > Redshift documentation > link([https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)] and > below is their mention of escaping requirements in the mentioned link > ESCAPE > For CHAR and VARCHAR columns in delimited unload files, an escape character > (\{{}}) is placed before every occurrence of the following characters: > * Linefeed: {{\n}} > * Carriage return: {{\r}} > * The delimiter character specified for the unloaded data. > * The escape character: \{{}} > * A quote character: {{"}} or {{'}} (if both ESCAPE and ADDQUOTES are > specified in the UNLOAD command). > > *Problem statement:* > But the spark CSV reader doesn't have a handle to treat/remove the escape > characters infront of the newline characters in the data. > It would really help if we can add a feature to handle the escaped newline > characters through another parameter like (escapeNewline = 'true/false'). > *Example:* > Below are the details of my test data set up in a file. > * The first record in that file has escaped windows newline character ( > r > n) > * The third record in that file has escaped unix newline character ( > n) > * The fifth record in that file has the escaped quote character (") > the file looks like below in vi editor: > > {code:java} > "1","this is \^M\ > line1"^M > "2","this is line2"^M > "3","this is \ > line3"^M > "4","this is \" line4"^M > "5","this is line5"^M{code} > > When I read the file in python's csv module with escape, it is able to remove > the added escape characters as you can see below, > > {code:java} > >>> with open('/tmp/test3.csv','r') as readCsv: > ... readFile = > csv.reader(readCsv,dialect='excel',escapechar='\\',quotechar='"',delimiter=',',doublequote=False) > ... for row in readFile: > ... print(row) > ... > ['1', 'this is \r\n line1'] > ['2', 'this is line2'] > ['3', 'this is \n line3'] > ['4', 'this is " line4'] > ['5', 'this is line5'] > {code} > But if I read the same file in spark-csv reader, the escape characters > infront of the newline characters are not removed.But the escape before the > (") is removed. > {code:java} > >>> redDf=spark.read.csv(path='file:///tmp/test3.csv',header='false',sep=',',quote='"',escape='\\',multiLine='true',ignoreLeadingWhiteSpace='true',ignoreTrailingWhiteSpace='true',mode='FAILFAST',inferSchema='false') > >>> redDf.show() > +---+------------------+ > |_c0| _c1| > +---+------------------+ > \ 1|this is \ > line1| > | 2| this is line2| > | 3| this is \ > line3| > | 4| this is " line4| > | 5| this is line5| > +---+------------------+ > {code} > *Expected result:* > {code:java} > +---+------------------+ > |_c0| _c1| > +---+------------------+ > | 1|this is > line1| > | 2| this is line2| > | 3| this is > line3| > | 4| this is " line4| > | 5| this is line5| > +---+------------------+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org