Hello Terry, I guess you hit this bug <https://issues.apache.org/jira/browse/SPARK-3559>. The list of needed column ids was messed up. Can you try the master branch or apply the code change <https://github.com/apache/spark/commit/e10d71e7e58bf2ec0f1942cb2f0602396ab866b4> to your 1.1 and see if the problem is resolved?
Thanks, Yin On Wed, Oct 15, 2014 at 12:08 PM, Terry Siu <terry....@smartfocus.com> wrote: > Hi Yin, > > pqt_rdt_snappy has 76 columns. These two parquet tables were created via > Hive 0.12 from existing Avro data using CREATE TABLE following by an INSERT > OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition > while pqt_segcust_snappy has two partitions. For pqt_segcust_snappy, I > noticed that when I populated it with a single INSERT OVERWRITE over all > the partitions and then executed the Spark code, it would report an illegal > index value of 29. However, if I manually did INSERT OVERWRITE for every > single partition, I would get an illegal index value of 21. I don’t know if > this will help in debugging, but here’s the DESCRIBE output for > pqt_segcust_snappy: > > OK > > col_name data_type comment > > customer_id string from deserializer > > age_range string from deserializer > > gender string from deserializer > > last_tx_date bigint from deserializer > > last_tx_date_ts string from deserializer > > last_tx_date_dt string from deserializer > > first_tx_date bigint from deserializer > > first_tx_date_ts string from deserializer > > first_tx_date_dt string from deserializer > > second_tx_date bigint from deserializer > > second_tx_date_ts string from deserializer > > second_tx_date_dt string from deserializer > > third_tx_date bigint from deserializer > > third_tx_date_ts string from deserializer > > third_tx_date_dt string from deserializer > > frequency double from deserializer > > tx_size double from deserializer > > recency double from deserializer > > rfm double from deserializer > > tx_count bigint from deserializer > > sales double from deserializer > > coll_def_id string None > > seg_def_id string None > > > > # Partition Information > > # col_name data_type comment > > > > coll_def_id string None > > seg_def_id string None > > Time taken: 0.788 seconds, Fetched: 29 row(s) > > > As you can see, I have 21 data columns, followed by the 2 partition > columns, coll_def_id and seg_def_id. Output shows 29 rows, but that looks > like it’s just counting the rows in the console output. Let me know if you > need more information. > > > Thanks > > -Terry > > > From: Yin Huai <huaiyin....@gmail.com> > Date: Tuesday, October 14, 2014 at 6:29 PM > To: Terry Siu <terry....@smartfocus.com> > Cc: Michael Armbrust <mich...@databricks.com>, "user@spark.apache.org" < > user@spark.apache.org> > > Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet > > Hello Terry, > > How many columns does pqt_rdt_snappy have? > > Thanks, > > Yin > > On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu <terry....@smartfocus.com> > wrote: > >> Hi Michael, >> >> That worked for me. At least I’m now further than I was. Thanks for the >> tip! >> >> -Terry >> >> From: Michael Armbrust <mich...@databricks.com> >> Date: Monday, October 13, 2014 at 5:05 PM >> To: Terry Siu <terry....@smartfocus.com> >> Cc: "user@spark.apache.org" <user@spark.apache.org> >> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet >> >> There are some known bug with the parquet serde and spark 1.1. >> >> You can try setting spark.sql.hive.convertMetastoreParquet=true to >> cause spark sql to use built in parquet support when the serde looks like >> parquet. >> >> On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu <terry....@smartfocus.com> >> wrote: >> >>> I am currently using Spark 1.1.0 that has been compiled against Hadoop >>> 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external >>> Hive tables that point to Parquet (compressed with Snappy), which were >>> converted over from Avro if that matters. >>> >>> I am trying to perform a join with these two Hive tables, but am >>> encountering an exception. In a nutshell, I launch a spark shell, create my >>> HiveContext (pointing to the correct metastore on our cluster), and then >>> proceed to do the following: >>> >>> scala> val hc = new HiveContext(sc) >>> >>> scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate >>> >= 1325376000000 and translate <= 1340063999999”) >>> >>> scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where >>> coll_def_id=‘abcd’”) >>> >>> scala> txn.registerAsTable(“segTxns”) >>> >>> scala> segcust.registerAsTable(“segCusts”) >>> >>> scala> val joined = hc.sql(“select t.transid, c.customer_id from >>> segTxns t join segCusts c on t.customerid=c.customer_id”) >>> >>> Straight forward enough, but I get the following exception: >>> >>> 14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0 >>> (TID 51) >>> >>> java.lang.IndexOutOfBoundsException: Index: 21, Size: 21 >>> >>> at java.util.ArrayList.rangeCheck(ArrayList.java:635) >>> >>> at java.util.ArrayList.get(ArrayList.java:411) >>> >>> at >>> org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94) >>> >>> at >>> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206) >>> >>> at >>> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:81) >>> >>> at >>> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:67) >>> >>> at >>> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51) >>> >>> at >>> org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:197) >>> >>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188) >>> >>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97) >>> >>> at >>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >>> >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) >>> >>> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) >>> >>> at >>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >>> >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) >>> >>> at >>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) >>> >>> at >>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >>> >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) >>> >>> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) >>> >>> at >>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >>> >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) >>> >>> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) >>> >>> at >>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >>> >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) >>> >>> at >>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) >>> >>> at org.apache.spark.scheduler.Task.run(Task.scala:54) >>> >>> at >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) >>> >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>> >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>> >>> >>> The number of columns in my table, pqt_segcust_snappy, has 21 columns >>> and two partitions defined. Does this error look familiar to anyone? Could >>> my usage of SparkSQL with Hive be incorrect or is support with >>> Hive/Parquet/partitioning still buggy at this point in Spark 1.1.0? >>> >>> >>> Thanks, >>> >>> -Terry >>> >> >> >