eugen yushin created SPARK-25506: ------------------------------------ Summary: Spark CSV multiline with CRLF Key: SPARK-25506 URL: https://issues.apache.org/jira/browse/SPARK-25506 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 2.3.1, 2.2.0 Environment: spark 2.2.0 and 2.3.1 scala 2.11.8 Reporter: eugen yushin
Spark produces empty rows (or ']' when printing via call to `collect`) dealing with '\r' character at the end of each line in CSV file. Note, no fields are escaped in original input file. {code:java} val multilineDf = sparkSession.read .format("csv") .options(Map("header" -> "true", "inferSchema" -> "false", "escape" -> "\"", "multiLine" -> "true")) .load("src/test/resources/multiLineHeader.csv") val df = sparkSession.read .format("csv") .options(Map("header" -> "true", "inferSchema" -> "false", "escape" -> "\"")) .load("src/test/resources/multiLineHeader.csv") multilineDf.show() multilineDf.collect().foreach(println) df.show() df.collect().foreach(println) {code} Result: {code:java} +----+-----+ | +----+-----+ | | +----+-----+ ] ] +----+----+ |col1|col2| +----+----+ | 1| 1| | 2| 2| +----+----+ [1,1] [2,2] {code} Input file: {code:java} cat -vt src/test/resources/multiLineHeader.csv col1,col2^M 1,1^M 2,2^M {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org