[jira] [Commented] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars
[ https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16495843#comment-16495843 ] Shixiong Zhu commented on SPARK-23649: -- [~cloud_fan] looks like this is fixed? > CSV schema inferring fails on some UTF-8 chars > -- > > Key: SPARK-23649 > URL: https://issues.apache.org/jira/browse/SPARK-23649 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf8xFF.csv > > > Schema inferring of CSV files fails if the file contains a char starts from > *0xFF.* > {code:java} > spark.read.option("header", "true").csv("utf8xFF.csv") > {code} > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 63 > at > org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191) > at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206) > {code} > Here is content of the file: > {code:java} > hexdump -C ~/tmp/utf8xFF.csv > 63 68 61 6e 6e 65 6c 2c 63 6f 64 65 0d 0a 55 6e |channel,code..Un| > 0010 69 74 65 64 2c 31 32 33 0d 0a 41 42 47 55 4e ff |ited,123..ABGUN.| > 0020 2c 34 35 36 0d|,456.| > 0025 > {code} > Schema inferring doesn't fail in multiline mode: > {code} > spark.read.option("header", "true").option("multiline", > "true").csv("utf8xFF.csv") > {code} > {code:java} > +---+-+ > |channel|code > +---+-+ > | United| 123 > | ABGUN�| 456 > +---+-+ > {code} > and Spark is able to read the csv file if the schema is specified: > {code} > import org.apache.spark.sql.types._ > val schema = new StructType().add("channel", StringType).add("code", > StringType) > spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show > {code} > {code:java} > +---++ > |channel|code| > +---++ > | United| 123| > | ABGUN�| 456| > +---++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23649) CSV schema inferring fails on some UTF-8 chars
[ https://issues.apache.org/jira/browse/SPARK-23649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394582#comment-16394582 ] Apache Spark commented on SPARK-23649: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/20796 > CSV schema inferring fails on some UTF-8 chars > -- > > Key: SPARK-23649 > URL: https://issues.apache.org/jira/browse/SPARK-23649 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf8xFF.csv > > > Schema inferring of CSV files fails if the file contains a char starts from > *0xFF.* > {code:java} > spark.read.option("header", "true").csv("utf8xFF.csv") > {code} > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 63 > at > org.apache.spark.unsafe.types.UTF8String.numBytesForFirstByte(UTF8String.java:191) > at org.apache.spark.unsafe.types.UTF8String.numChars(UTF8String.java:206) > {code} > Here is content of the file: > {code:java} > hexdump -C ~/tmp/utf8xFF.csv > 63 68 61 6e 6e 65 6c 2c 63 6f 64 65 0d 0a 55 6e |channel,code..Un| > 0010 69 74 65 64 2c 31 32 33 0d 0a 41 42 47 55 4e ff |ited,123..ABGUN.| > 0020 2c 34 35 36 0d|,456.| > 0025 > {code} > Schema inferring doesn't fail in multiline mode: > {code} > spark.read.option("header", "true").option("multiline", > "true").csv("utf8xFF.csv") > {code} > {code:java} > +---+-+ > |channel|code > +---+-+ > | United| 123 > | ABGUN�| 456 > +---+-+ > {code} > and Spark is able to read the csv file if the schema is specified: > {code} > import org.apache.spark.sql.types._ > val schema = new StructType().add("channel", StringType).add("code", > StringType) > spark.read.option("header", "true").schema(schema).csv("utf8xFF.csv").show > {code} > {code:java} > +---++ > |channel|code| > +---++ > | United| 123| > | ABGUN�| 456| > +---++ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org