[ 
https://issues.apache.org/jira/browse/SPARK-40982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40982:
------------------------------------

    Assignee: Apache Spark

> When the value of quote or escape exists in the content of csv file, the 
> character in the csv file will be misidentified
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-40982
>                 URL: https://issues.apache.org/jira/browse/SPARK-40982
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.2.1
>            Reporter: clairezhuang
>            Assignee: Apache Spark
>            Priority: Minor
>
> When the value of quote or escape exists in the content of csv file, the 
> character in the csv file will be misidentified
> We found that when the value of quote or escape exists in the content of csv 
> file, the character in the csv file will be misidentified.
> When this content is being read by Azure Data Factory copy activity and 
> written to CSV, the content is
> "test\\" =
> test"
> we read csv as below:
> df = spark.read.csv(path='test.csv'
> , sep=','
> , header=True
> , quote='"'
> , escape='\'
> , multiLine=True
> , lineSep='\n'
> )
> resulting in the following being written to the CSV: *test\" =* and *test* in 
> the next line ,but what we want {*}test\\" = test{*}.
> Now when the above is being read by Spark:
>  # The first \ is being interpreted as being an escaping of the second \ (so 
> the content looks like a single literal )
>  # The " now appears to be an unescaped quote character, so we're back in the 
> situation where Spark tries to handle this using STOP_AT_DELIMITER.
> As before, the rest of the CSV after this point is being parsed incorrectly.
> We could change the "quote,escape..." to avoid it for the scenario above, but 
> the content of their csv file is very large and it may occur any character. 
> the data sources that we have which are affected by this issue are systems 
> outside of our control, so we have no means of controlling what 
> content/characters will be there.When we change the "quote,escape...", it may 
> conflict with the content again, and it still have issues in the following 
> content.
> As far as designing the content to avoid certain characters - the data 
> sources that we have which are affected by this issue are systems outside of 
> our control, so we have no means of controlling what content/characters will 
> be there.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to