[ https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maxim Gekk updated SPARK-23649: ------------------------------- Shepherd: Herman van Hovell > CSV schema inferring fails on some UTF-8 chars > ---------------------------------------------- > > Key: SPARK-23649 > URL: https://issues.apache.org/jira/browse/SPARK-23649 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Maxim Gekk > Priority: Major > Attachments: utf8xFF.csv > > > Schema inferring of CSV files fails if the file contains a char starts from > *0xFF.* > {code:java} > spark.read.option("header", "true").csv("utf8xFF.csv") > {code} > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 63 > at > org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191) > at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206) > {code} > Here is content of the file: > {code:java} > hexdump -C ~/tmp/utf8xFF.csv > 00000000 63 68 61 6e 6e 65 6c 2c 63 6f 64 65 0d 0a 55 6e |channel,code..Un| > 00000010 69 74 65 64 2c 31 32 33 0d 0a 41 42 47 55 4e ff |ited,123..ABGUN.| > 00000020 2c 34 35 36 0d |,456.| > 00000025 > {code} > Schema inferring doesn't fail in multiline mode: > {code} > spark.read.option("header", "true").option("multiline", > "true").csv("utf8xFF.csv") > {code} > {code:java} > +-------+-----+ > |channel|code > +-------+-----+ > | United| 123 > | ABGUN�| 456 > +-------+-----+ > {code} > and Spark is able to read the csv file if the schema is specified: > {code} > import org.apache.spark.sql.types._ > val schema = new StructType().add("channel", StringType).add("code", > StringType) > spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show > {code} > {code:java} > +-------+----+ > |channel|code| > +-------+----+ > | United| 123| > | ABGUN�| 456| > +-------+----+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org