Maxim Gekk created SPARK-23649: ---------------------------------- Summary: CSV schema inferring fails on some UTF-8 chars Key: SPARK-23649 URL: https://issues.apache.org/jira/browse/SPARK-23649 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk
Schema inferring of CSV files fails if the file contains a char starts from *0xFF.* {code:java} spark.read.option("header", "true").csv("utf8xFF.csv") {code} {code:java} java.lang.ArrayIndexOutOfBoundsException: 63 at org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191) at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206) {code} Here is content of the file: {code:java} hexdump -C ~/tmp/utf8xFF.csv 00000000 63 68 61 6e 6e 65 6c 2c 63 6f 64 65 0d 0a 55 6e |channel,code..Un| 00000010 69 74 65 64 2c 31 32 33 0d 0a 41 42 47 55 4e ff |ited,123..ABGUN.| 00000020 2c 34 35 36 0d |,456.| 00000025 {code} Schema inferring doesn't fail in multiline mode: {code} spark.read.option("header", "true").option("multiline", "true").csv("utf8xFF.csv") {code} {code:java} +-------+-----+ |channel|code +-------+-----+ | United| 123 | ABGUN�| 456 +-------+-----+ {code} and Spark is able to read the csv file if the schema is specified: {code} import org.apache.spark.sql.types._ val schema = new StructType().add("channel", StringType).add("code", StringType) spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show {code} {code:java} +-------+----+ |channel|code| +-------+----+ | United| 123| | ABGUN�| 456| +-------+----+ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org