[jira] [Commented] (SPARK-38955) from_csv can corrupt surrounding lines if a lineSep is in the data

Thomas Graves (Jira) Wed, 20 Apr 2022 06:15:09 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-38955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524985#comment-17524985
 ]


Thomas Graves commented on SPARK-38955:
---------------------------------------

the from_csv docs point to the data source options which contain the lineSep so 
it seems like we should update documentation and then like you said don't 
permit it to be specified. since its a corruption seems bad, so marking as 
blocker to atleast get more visibility and input.

> from_csv can corrupt surrounding lines if a lineSep is in the data
> ------------------------------------------------------------------
>
>                 Key: SPARK-38955
>                 URL: https://issues.apache.org/jira/browse/SPARK-38955
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Robert Joseph Evans
>            Priority: Major
>
> I don't know how critical this is. I was doing some general testing to 
> understand {{from_csv}} and found that if I happen to have a {{lineSep}} in 
> the input data and I noticed that the next row appears to be corrupted. 
> {{multiLine}} does not appear to fix it. Because this is data corruption I am 
> inclined to mark this as CRITICAL or BLOCKER, but it is an odd corner case so 
> I m not going to set it myself.
> {code}
> Seq[String]("1,\n2,3,4,5","6,7,8,9,10", "11,12,13,14,15", 
> null).toDF.select(col("value"), from_csv(col("value"), 
> StructType(Seq(StructField("a", LongType), StructField("b", StringType))), 
> Map[String,String]())).show()
> +--------------+---------------+
> |         value|from_csv(value)|
> +--------------+---------------+
> |   1,\n2,3,4,5|      {1, null}|
> |    6,7,8,9,10|      {null, 8}|
> |11,12,13,14,15|       {11, 12}|
> |          null|           null|
> +--------------+---------------+
> {code}
> {code}
> Seq[String]("1,:2,3,4,5","6,7,8,9,10", "11,12,13,14,15", 
> null).toDF.select(col("value"), from_csv(col("value"), 
> StructType(Seq(StructField("a", LongType), StructField("b", StringType))), 
> Map[String,String]("lineSep" -> ":"))).show()
> +--------------+---------------+
> |         value|from_csv(value)|
> +--------------+---------------+
> |    1,:2,3,4,5|      {1, null}|
> |    6,7,8,9,10|      {null, 8}|
> |11,12,13,14,15|       {11, 12}|
> |          null|           null|
> +--------------+---------------+
> {code}
> {code}
> Seq[String]("1,\n2,3,4,5","6,7,8,9,10", "11,12,13,14,15", 
> null).toDF.select(col("value"), from_csv(col("value"), 
> StructType(Seq(StructField("a", LongType), StructField("b", StringType))), 
> Map[String,String]("lineSep" -> ":"))).show()
> +--------------+---------------+
> |         value|from_csv(value)|
> +--------------+---------------+
> |   1,\n2,3,4,5|       {1, \n2}|
> |    6,7,8,9,10|         {6, 7}|
> |11,12,13,14,15|       {11, 12}|
> |          null|           null|
> +--------------+---------------+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38955) from_csv can corrupt surrounding lines if a lineSep is in the data

Reply via email to