There are some known bug with the parquet serde and spark 1.1. You can try setting spark.sql.hive.convertMetastoreParquet=true to cause spark sql to use built in parquet support when the serde looks like parquet.
On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu <terry....@smartfocus.com> wrote: > I am currently using Spark 1.1.0 that has been compiled against Hadoop > 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external > Hive tables that point to Parquet (compressed with Snappy), which were > converted over from Avro if that matters. > > I am trying to perform a join with these two Hive tables, but am > encountering an exception. In a nutshell, I launch a spark shell, create my > HiveContext (pointing to the correct metastore on our cluster), and then > proceed to do the following: > > scala> val hc = new HiveContext(sc) > > scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate >= > 1325376000000 and translate <= 1340063999999”) > > scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where > coll_def_id=‘abcd’”) > > scala> txn.registerAsTable(“segTxns”) > > scala> segcust.registerAsTable(“segCusts”) > > scala> val joined = hc.sql(“select t.transid, c.customer_id from segTxns > t join segCusts c on t.customerid=c.customer_id”) > > Straight forward enough, but I get the following exception: > > 14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0 > (TID 51) > > java.lang.IndexOutOfBoundsException: Index: 21, Size: 21 > > at java.util.ArrayList.rangeCheck(ArrayList.java:635) > > at java.util.ArrayList.get(ArrayList.java:411) > > at > org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94) > > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206) > > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:81) > > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:67) > > at > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51) > > at > org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:197) > > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188) > > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > > at org.apache.spark.scheduler.Task.run(Task.scala:54) > > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > > The number of columns in my table, pqt_segcust_snappy, has 21 columns > and two partitions defined. Does this error look familiar to anyone? Could > my usage of SparkSQL with Hive be incorrect or is support with > Hive/Parquet/partitioning still buggy at this point in Spark 1.1.0? > > > Thanks, > > -Terry >