[ https://issues.apache.org/jira/browse/SPARK-40982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-40982: ------------------------------------ Assignee: Apache Spark > When the value of quote or escape exists in the content of csv file, the > character in the csv file will be misidentified > ------------------------------------------------------------------------------------------------------------------------ > > Key: SPARK-40982 > URL: https://issues.apache.org/jira/browse/SPARK-40982 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 3.2.1 > Reporter: clairezhuang > Assignee: Apache Spark > Priority: Minor > > When the value of quote or escape exists in the content of csv file, the > character in the csv file will be misidentified > We found that when the value of quote or escape exists in the content of csv > file, the character in the csv file will be misidentified. > When this content is being read by Azure Data Factory copy activity and > written to CSV, the content is > "test\\" = > test" > we read csv as below: > df = spark.read.csv(path='test.csv' > , sep=',' > , header=True > , quote='"' > , escape='\' > , multiLine=True > , lineSep='\n' > ) > resulting in the following being written to the CSV: *test\" =* and *test* in > the next line ,but what we want {*}test\\" = test{*}. > Now when the above is being read by Spark: > # The first \ is being interpreted as being an escaping of the second \ (so > the content looks like a single literal ) > # The " now appears to be an unescaped quote character, so we're back in the > situation where Spark tries to handle this using STOP_AT_DELIMITER. > As before, the rest of the CSV after this point is being parsed incorrectly. > We could change the "quote,escape..." to avoid it for the scenario above, but > the content of their csv file is very large and it may occur any character. > the data sources that we have which are affected by this issue are systems > outside of our control, so we have no means of controlling what > content/characters will be there.When we change the "quote,escape...", it may > conflict with the content again, and it still have issues in the following > content. > As far as designing the content to avoid certain characters - the data > sources that we have which are affected by this issue are systems outside of > our control, so we have no means of controlling what content/characters will > be there. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org