[ https://issues.apache.org/jira/browse/SPARK-19430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15853466#comment-15853466 ]
Song Jun commented on SPARK-19430: ---------------------------------- I think this is not a bug. If you want to access the hive table ,you can directly use ` spark.table("orc_varchar_test").show ` > Cannot read external tables with VARCHAR columns if they're backed by ORC > files written by Hive 1.2.1 > ----------------------------------------------------------------------------------------------------- > > Key: SPARK-19430 > URL: https://issues.apache.org/jira/browse/SPARK-19430 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.3, 2.0.2, 2.1.0 > Reporter: Sameer Agarwal > > Spark throws an exception when trying to read external tables with VARCHAR > columns if they're backed by ORC files that were written by Hive 1.2.1 (and > possibly other versions of hive). > Steps to reproduce (credits to [~lian cheng]): > # Write an ORC table using Hive 1.2.1 with > {noformat} > CREATE TABLE orc_varchar_test STORED AS ORC > AS SELECT CASTE('a' AS VARCHAR(10)) AS c0{noformat} > # Get the raw path of the written ORC file > # Create an external table pointing to this file and read the table using > Spark > {noformat} > val path = "/tmp/orc_varchar_test" > sql(s"create external table if not exists test (c0 varchar(10)) stored as orc > location '$path'") > spark.table("test").show(){noformat} > The problem here is that the metadata in the ORC file written by Hive is > different from those written by Spark. We can inspect the ORC file written > above: > {noformat} > $ hive --orcfiledump > file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/000000_0 > Structure for > file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/000000_0 > File Version: 0.12 with HIVE_8732 > Rows: 1 > Compression: ZLIB > Compression size: 262144 > Type: struct<_col0:varchar(10)> <---- > ... > {noformat} > On the other hand, if you create an ORC table using the same DDL and inspect > the written ORC file, you'll see: > {noformat} > ... > Type: struct<c0:string> > ... > {noformat} > Note that all tests are done with {{spark.sql.hive.convertMetastoreOrc}} set > to {{false}}, which is the default case. > I've verified that Spark 1.6.x, 2.0.x and 2.1.x all fail with instances of > the following error: > {code} > java.lang.ClassCastException: > org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to > org.apache.hadoop.io.Text > at > org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) > at > org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:529) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org