[jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

Hyukjin Kwon (JIRA) Wed, 23 Aug 2017 06:49:26 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138365#comment-16138365
 ]


Hyukjin Kwon commented on SPARK-21820:
--------------------------------------

I think the preferable format should be {{format("csv")}} for the built-in 
Spark CSV. {{.format("com.databricks.spark.csv")}} basically indicates 
thirdparty CSV library in Databricks repository which is not for Spark 2.x, 
although we had to make some changes within Spark to choose Spark's internal 
one for such cases, e.g., SPARK-20590. Let's avoid to report a JIRA with 
{{"com.databricks.spark.csv"}} in the future to prevent confusion.

For {{multiLine}} in CSV, the newline is dependent on OS whereas TEXT, JSON and 
CSV datasources by default deal with some newlines together such as on Windows 
and Linux, by Hadoop's library, up to my knowledge. I proposed a change for a 
configurable newline - https://github.com/apache/spark/pull/18581. I guess this 
will address this problem together.

> csv option "multiLine" as "true" not parsing windows line feed (CR LF) 
> properly
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-21820
>                 URL: https://issues.apache.org/jira/browse/SPARK-21820
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Kumaresh C R
>              Labels: features
>         Attachments: windows_CRLF.csv
>
>
> With multiLine=true, windows CR LF is not getting parsed properly. If i make 
> multiLine=false, it parses properly. Could you please help here ?
> Attached the CSV used in the below commands for your reference.
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered)
> scala> val csvFile = 
> spark.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").option("parserLib", 
> "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv");
> csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: 
> string ... 1 more field]
> scala> csvFile.schema.fieldNames
> ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21820) csv option "multiLine" as "true" not parsing windows line feed (CR LF) properly

Reply via email to