clairezhuang created SPARK-40982:
------------------------------------

             Summary: When the value of quote or escape exists in the content 
of csv file, the character in the csv file will be misidentified
                 Key: SPARK-40982
                 URL: https://issues.apache.org/jira/browse/SPARK-40982
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 3.2.1
            Reporter: clairezhuang


When the value of quote or escape exists in the content of csv file, the 
character in the csv file will be misidentified
We found that when the value of quote or escape exists in the content of csv 
file, the character in the csv file will be misidentified.
When this content is being read by Azure Data Factory copy activity and written 
to CSV, the content is
"test\\" =
test"
we read csv as below:
df = spark.read.csv(path='test.csv'
, sep=','
, header=True
, quote='"'
, escape='\'
, multiLine=True
, lineSep='\n'
)
resulting in the following being written to the CSV: *test\" =* and *test* in 
the next line ,but what we want {*}test\\" = test{*}.
Now when the above is being read by Spark:
 # The first \ is being interpreted as being an escaping of the second \ (so 
the content looks like a single literal )
 # The " now appears to be an unescaped quote character, so we're back in the 
situation where Spark tries to handle this using STOP_AT_DELIMITER.
As before, the rest of the CSV after this point is being parsed incorrectly.

We could change the "quote,escape..." to avoid it for the scenario above, but 
the content of their csv file is very large and it may occur any character. the 
data sources that we have which are affected by this issue are systems outside 
of our control, so we have no means of controlling what content/characters will 
be there.When we change the "quote,escape...", it may conflict with the content 
again, and it still have issues in the following content.
As far as designing the content to avoid certain characters - the data sources 
that we have which are affected by this issue are systems outside of our 
control, so we have no means of controlling what content/characters will be 
there.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to