[ https://issues.apache.org/jira/browse/SPARK-21820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138365#comment-16138365 ]
Hyukjin Kwon commented on SPARK-21820: -------------------------------------- I think the preferable format should be {{format("csv")}} for the built-in Spark CSV. {{.format("com.databricks.spark.csv")}} basically indicates thirdparty CSV library in Databricks repository which is not for Spark 2.x, although we had to make some changes within Spark to choose Spark's internal one for such cases, e.g., SPARK-20590. Let's avoid to report a JIRA with {{"com.databricks.spark.csv"}} in the future to prevent confusion. For {{multiLine}} in CSV, the newline is dependent on OS whereas TEXT, JSON and CSV datasources by default deal with some newlines together such as on Windows and Linux, by Hadoop's library, up to my knowledge. I proposed a change for a configurable newline - https://github.com/apache/spark/pull/18581. I guess this will address this problem together. > csv option "multiLine" as "true" not parsing windows line feed (CR LF) > properly > ------------------------------------------------------------------------------- > > Key: SPARK-21820 > URL: https://issues.apache.org/jira/browse/SPARK-21820 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.2.0 > Reporter: Kumaresh C R > Labels: features > Attachments: windows_CRLF.csv > > > With multiLine=true, windows CR LF is not getting parsed properly. If i make > multiLine=false, it parses properly. Could you please help here ? > Attached the CSV used in the below commands for your reference. > scala> val csvFile = > spark.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("parserLib", > "univocity").load("/home/kumar/Desktop/windows_CRLF.csv"); > csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: > string ... 1 more field] > scala> csvFile.schema.fieldNames > res0: Array[String] = Array(Sales_Dollars, Created_Date, Order_Delivered) > scala> val csvFile = > spark.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").option("parserLib", > "univocity").option("multiLine","true").load("/home/kumar/Desktop/windows_CRLF.csv"); > csvFile: org.apache.spark.sql.DataFrame = [Sales_Dollars: int, Created_Date: > string ... 1 more field] > scala> csvFile.schema.fieldNames > ")s1: Array[String] = Array(Sales_Dollars, Created_Date, "Order_Delivered -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org