Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
Hi Yin, Sorry for the delay, but I’ll try the code change when I get a chance, but Michael’s initial response did solve my problem. In the meantime, I’m hitting another issue with SparkSQL which I will probably post another message if I can’t figure a workaround. Thanks, -Terry From: Yin Huai mailto:huaiyin@gmail.com>> Date: Thursday, October 16, 2014 at 7:08 AM To: Terry Siu mailto:terry@smartfocus.com>> Cc: Michael Armbrust mailto:mich...@databricks.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet Hello Terry, I guess you hit this bug<https://issues.apache.org/jira/browse/SPARK-3559>. The list of needed column ids was messed up. Can you try the master branch or apply the code change<https://github.com/apache/spark/commit/e10d71e7e58bf2ec0f1942cb2f0602396ab866b4> to your 1.1 and see if the problem is resolved? Thanks, Yin On Wed, Oct 15, 2014 at 12:08 PM, Terry Siu mailto:terry@smartfocus.com>> wrote: Hi Yin, pqt_rdt_snappy has 76 columns. These two parquet tables were created via Hive 0.12 from existing Avro data using CREATE TABLE following by an INSERT OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition while pqt_segcust_snappy has two partitions. For pqt_segcust_snappy, I noticed that when I populated it with a single INSERT OVERWRITE over all the partitions and then executed the Spark code, it would report an illegal index value of 29. However, if I manually did INSERT OVERWRITE for every single partition, I would get an illegal index value of 21. I don’t know if this will help in debugging, but here’s the DESCRIBE output for pqt_segcust_snappy: OK col_namedata_type comment customer_id string from deserializer age_range string from deserializer gender string from deserializer last_tx_datebigint from deserializer last_tx_date_ts string from deserializer last_tx_date_dt string from deserializer first_tx_date bigint from deserializer first_tx_date_tsstring from deserializer first_tx_date_dtstring from deserializer second_tx_date bigint from deserializer second_tx_date_ts string from deserializer second_tx_date_dt string from deserializer third_tx_date bigint from deserializer third_tx_date_tsstring from deserializer third_tx_date_dtstring from deserializer frequency double from deserializer tx_size double from deserializer recency double from deserializer rfm double from deserializer tx_countbigint from deserializer sales double from deserializer coll_def_id string None seg_def_id string None # Partition Information # col_name data_type comment coll_def_id string None seg_def_id string None Time taken: 0.788 seconds, Fetched: 29 row(s) As you can see, I have 21 data columns, followed by the 2 partition columns, coll_def_id and seg_def_id. Output shows 29 rows, but that looks like it’s just counting the rows in the console output. Let me know if you need more information. Thanks -Terry From: Yin Huai mailto:huaiyin@gmail.com>> Date: Tuesday, October 14, 2014 at 6:29 PM To: Terry Siu mailto:terry@smartfocus.com>> Cc: Michael Armbrust mailto:mich...@databricks.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet Hello Terry, How many columns does pqt_rdt_snappy have? Thanks, Yin On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu mailto:terry@smartfocus.com>> wrote: Hi Michael, That worked for me. At least I’m now further than I was. Thanks for the tip! -Terry From: Michael Armbrust mailto:mich...@databricks.com>> Date: Monday, October 13, 2014 at 5:05 PM To: Terry Siu mailto:terry@smartfocus.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet There are some known bug with the parquet serde and spark 1.1. You can try setting spark.sql.hive.convertMet
Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
Hello Terry, I guess you hit this bug <https://issues.apache.org/jira/browse/SPARK-3559>. The list of needed column ids was messed up. Can you try the master branch or apply the code change <https://github.com/apache/spark/commit/e10d71e7e58bf2ec0f1942cb2f0602396ab866b4> to your 1.1 and see if the problem is resolved? Thanks, Yin On Wed, Oct 15, 2014 at 12:08 PM, Terry Siu wrote: > Hi Yin, > > pqt_rdt_snappy has 76 columns. These two parquet tables were created via > Hive 0.12 from existing Avro data using CREATE TABLE following by an INSERT > OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition > while pqt_segcust_snappy has two partitions. For pqt_segcust_snappy, I > noticed that when I populated it with a single INSERT OVERWRITE over all > the partitions and then executed the Spark code, it would report an illegal > index value of 29. However, if I manually did INSERT OVERWRITE for every > single partition, I would get an illegal index value of 21. I don’t know if > this will help in debugging, but here’s the DESCRIBE output for > pqt_segcust_snappy: > > OK > > col_namedata_type comment > > customer_id string from deserializer > > age_range string from deserializer > > gender string from deserializer > > last_tx_datebigint from deserializer > > last_tx_date_ts string from deserializer > > last_tx_date_dt string from deserializer > > first_tx_date bigint from deserializer > > first_tx_date_tsstring from deserializer > > first_tx_date_dtstring from deserializer > > second_tx_date bigint from deserializer > > second_tx_date_ts string from deserializer > > second_tx_date_dt string from deserializer > > third_tx_date bigint from deserializer > > third_tx_date_tsstring from deserializer > > third_tx_date_dtstring from deserializer > > frequency double from deserializer > > tx_size double from deserializer > > recency double from deserializer > > rfm double from deserializer > > tx_countbigint from deserializer > > sales double from deserializer > > coll_def_id string None > > seg_def_id string None > > > > # Partition Information > > # col_name data_type comment > > > > coll_def_id string None > > seg_def_id string None > > Time taken: 0.788 seconds, Fetched: 29 row(s) > > > As you can see, I have 21 data columns, followed by the 2 partition > columns, coll_def_id and seg_def_id. Output shows 29 rows, but that looks > like it’s just counting the rows in the console output. Let me know if you > need more information. > > > Thanks > > -Terry > > > From: Yin Huai > Date: Tuesday, October 14, 2014 at 6:29 PM > To: Terry Siu > Cc: Michael Armbrust , "user@spark.apache.org" < > user@spark.apache.org> > > Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet > > Hello Terry, > > How many columns does pqt_rdt_snappy have? > > Thanks, > > Yin > > On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu > wrote: > >> Hi Michael, >> >> That worked for me. At least I’m now further than I was. Thanks for the >> tip! >> >> -Terry >> >> From: Michael Armbrust >> Date: Monday, October 13, 2014 at 5:05 PM >> To: Terry Siu >> Cc: "user@spark.apache.org" >> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet >> >> There are some known bug with the parquet serde and spark 1.1. >> >> You can try setting spark.sql.hive.convertMetastoreParquet=true to >> cause spark sql to use built in parquet support when the serde looks like >> parquet. >> >> On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu >> wrote: >> >>> I am currently using Spark 1.1.0 that has been compiled against Hadoop >>> 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external >>> Hive tables that point to Parquet (
Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
Hi Yin, pqt_rdt_snappy has 76 columns. These two parquet tables were created via Hive 0.12 from existing Avro data using CREATE TABLE following by an INSERT OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition while pqt_segcust_snappy has two partitions. For pqt_segcust_snappy, I noticed that when I populated it with a single INSERT OVERWRITE over all the partitions and then executed the Spark code, it would report an illegal index value of 29. However, if I manually did INSERT OVERWRITE for every single partition, I would get an illegal index value of 21. I don’t know if this will help in debugging, but here’s the DESCRIBE output for pqt_segcust_snappy: OK col_namedata_type comment customer_id string from deserializer age_range string from deserializer gender string from deserializer last_tx_datebigint from deserializer last_tx_date_ts string from deserializer last_tx_date_dt string from deserializer first_tx_date bigint from deserializer first_tx_date_tsstring from deserializer first_tx_date_dtstring from deserializer second_tx_date bigint from deserializer second_tx_date_ts string from deserializer second_tx_date_dt string from deserializer third_tx_date bigint from deserializer third_tx_date_tsstring from deserializer third_tx_date_dtstring from deserializer frequency double from deserializer tx_size double from deserializer recency double from deserializer rfm double from deserializer tx_countbigint from deserializer sales double from deserializer coll_def_id string None seg_def_id string None # Partition Information # col_name data_type comment coll_def_id string None seg_def_id string None Time taken: 0.788 seconds, Fetched: 29 row(s) As you can see, I have 21 data columns, followed by the 2 partition columns, coll_def_id and seg_def_id. Output shows 29 rows, but that looks like it’s just counting the rows in the console output. Let me know if you need more information. Thanks -Terry From: Yin Huai mailto:huaiyin@gmail.com>> Date: Tuesday, October 14, 2014 at 6:29 PM To: Terry Siu mailto:terry@smartfocus.com>> Cc: Michael Armbrust mailto:mich...@databricks.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet Hello Terry, How many columns does pqt_rdt_snappy have? Thanks, Yin On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu mailto:terry@smartfocus.com>> wrote: Hi Michael, That worked for me. At least I’m now further than I was. Thanks for the tip! -Terry From: Michael Armbrust mailto:mich...@databricks.com>> Date: Monday, October 13, 2014 at 5:05 PM To: Terry Siu mailto:terry@smartfocus.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet There are some known bug with the parquet serde and spark 1.1. You can try setting spark.sql.hive.convertMetastoreParquet=true to cause spark sql to use built in parquet support when the serde looks like parquet. On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu mailto:terry@smartfocus.com>> wrote: I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables that point to Parquet (compressed with Snappy), which were converted over from Avro if that matters. I am trying to perform a join with these two Hive tables, but am encountering an exception. In a nutshell, I launch a spark shell, create my HiveContext (pointing to the correct metastore on our cluster), and then proceed to do the following: scala> val hc = new HiveContext(sc) scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate >= 132537600 and translate <= 134006399”) scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where coll_def_id=‘abcd’”) scala> txn.registerAsTable(“segTxns”) scala> segcust.registerAsTable(“segCusts”) scala> val joined = hc.sql(“select t.transid, c.customer_id fro
Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
Hello Terry, How many columns does pqt_rdt_snappy have? Thanks, Yin On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu wrote: > Hi Michael, > > That worked for me. At least I’m now further than I was. Thanks for the > tip! > > -Terry > > From: Michael Armbrust > Date: Monday, October 13, 2014 at 5:05 PM > To: Terry Siu > Cc: "user@spark.apache.org" > Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet > > There are some known bug with the parquet serde and spark 1.1. > > You can try setting spark.sql.hive.convertMetastoreParquet=true to cause > spark sql to use built in parquet support when the serde looks like parquet. > > On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu > wrote: > >> I am currently using Spark 1.1.0 that has been compiled against Hadoop >> 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external >> Hive tables that point to Parquet (compressed with Snappy), which were >> converted over from Avro if that matters. >> >> I am trying to perform a join with these two Hive tables, but am >> encountering an exception. In a nutshell, I launch a spark shell, create my >> HiveContext (pointing to the correct metastore on our cluster), and then >> proceed to do the following: >> >> scala> val hc = new HiveContext(sc) >> >> scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate >> >= 132537600 and translate <= 134006399”) >> >> scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where >> coll_def_id=‘abcd’”) >> >> scala> txn.registerAsTable(“segTxns”) >> >> scala> segcust.registerAsTable(“segCusts”) >> >> scala> val joined = hc.sql(“select t.transid, c.customer_id from >> segTxns t join segCusts c on t.customerid=c.customer_id”) >> >> Straight forward enough, but I get the following exception: >> >> 14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0 >> (TID 51) >> >> java.lang.IndexOutOfBoundsException: Index: 21, Size: 21 >> >> at java.util.ArrayList.rangeCheck(ArrayList.java:635) >> >> at java.util.ArrayList.get(ArrayList.java:411) >> >> at >> org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94) >> >> at >> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206) >> >> at >> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:81) >> >> at >> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:67) >> >> at >> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51) >> >> at >> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:197) >> >> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188) >> >> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97) >> >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >> >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) >> >> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) >> >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >> >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) >> >> at >> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) >> >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >> >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) >> >> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) >> >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >> >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) >> >> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) >> >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >> >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) >> >> at >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) >> >> at org.apache.spark.scheduler.Task.run(Task.scala:54) >> >> at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) >> >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> >> >> The number of columns in my table, pqt_segcust_snappy, has 21 columns >> and two partitions defined. Does this error look familiar to anyone? Could >> my usage of SparkSQL with Hive be incorrect or is support with >> Hive/Parquet/partitioning still buggy at this point in Spark 1.1.0? >> >> >> Thanks, >> >> -Terry >> > >
Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
Hi Michael, That worked for me. At least I’m now further than I was. Thanks for the tip! -Terry From: Michael Armbrust mailto:mich...@databricks.com>> Date: Monday, October 13, 2014 at 5:05 PM To: Terry Siu mailto:terry@smartfocus.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet There are some known bug with the parquet serde and spark 1.1. You can try setting spark.sql.hive.convertMetastoreParquet=true to cause spark sql to use built in parquet support when the serde looks like parquet. On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu mailto:terry@smartfocus.com>> wrote: I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables that point to Parquet (compressed with Snappy), which were converted over from Avro if that matters. I am trying to perform a join with these two Hive tables, but am encountering an exception. In a nutshell, I launch a spark shell, create my HiveContext (pointing to the correct metastore on our cluster), and then proceed to do the following: scala> val hc = new HiveContext(sc) scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate >= 132537600 and translate <= 134006399”) scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where coll_def_id=‘abcd’”) scala> txn.registerAsTable(“segTxns”) scala> segcust.registerAsTable(“segCusts”) scala> val joined = hc.sql(“select t.transid, c.customer_id from segTxns t join segCusts c on t.customerid=c.customer_id”) Straight forward enough, but I get the following exception: 14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0 (TID 51) java.lang.IndexOutOfBoundsException: Index: 21, Size: 21 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:81) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:67) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:197) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) The number of columns in my table, pqt_segcust_snappy, has 21 columns and two partitions defined. Does this error look familiar to anyone? Could my usage of SparkSQL with Hive be incorrect or is support with Hive/Parquet/partitioning still buggy at this point in Spark 1.1.0? Thanks, -Terry
Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
There are some known bug with the parquet serde and spark 1.1. You can try setting spark.sql.hive.convertMetastoreParquet=true to cause spark sql to use built in parquet support when the serde looks like parquet. On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu wrote: > I am currently using Spark 1.1.0 that has been compiled against Hadoop > 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external > Hive tables that point to Parquet (compressed with Snappy), which were > converted over from Avro if that matters. > > I am trying to perform a join with these two Hive tables, but am > encountering an exception. In a nutshell, I launch a spark shell, create my > HiveContext (pointing to the correct metastore on our cluster), and then > proceed to do the following: > > scala> val hc = new HiveContext(sc) > > scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate >= > 132537600 and translate <= 134006399”) > > scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where > coll_def_id=‘abcd’”) > > scala> txn.registerAsTable(“segTxns”) > > scala> segcust.registerAsTable(“segCusts”) > > scala> val joined = hc.sql(“select t.transid, c.customer_id from segTxns > t join segCusts c on t.customerid=c.customer_id”) > > Straight forward enough, but I get the following exception: > > 14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0 > (TID 51) > > java.lang.IndexOutOfBoundsException: Index: 21, Size: 21 > > at java.util.ArrayList.rangeCheck(ArrayList.java:635) > > at java.util.ArrayList.get(ArrayList.java:411) > > at > org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94) > > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206) > > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:81) > > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:67) > > at > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51) > > at > org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:197) > > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188) > > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > > at org.apache.spark.scheduler.Task.run(Task.scala:54) > > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > > The number of columns in my table, pqt_segcust_snappy, has 21 columns > and two partitions defined. Does this error look familiar to anyone? Could > my usage of SparkSQL with Hive be incorrect or is support with > Hive/Parquet/partitioning still buggy at this point in Spark 1.1.0? > > > Thanks, > > -Terry >
SparkSQL IndexOutOfBoundsException when reading from Parquet
I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables that point to Parquet (compressed with Snappy), which were converted over from Avro if that matters. I am trying to perform a join with these two Hive tables, but am encountering an exception. In a nutshell, I launch a spark shell, create my HiveContext (pointing to the correct metastore on our cluster), and then proceed to do the following: scala> val hc = new HiveContext(sc) scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate >= 132537600 and translate <= 134006399”) scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where coll_def_id=‘abcd’”) scala> txn.registerAsTable(“segTxns”) scala> segcust.registerAsTable(“segCusts”) scala> val joined = hc.sql(“select t.transid, c.customer_id from segTxns t join segCusts c on t.customerid=c.customer_id”) Straight forward enough, but I get the following exception: 14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0 (TID 51) java.lang.IndexOutOfBoundsException: Index: 21, Size: 21 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:81) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:67) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:197) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) The number of columns in my table, pqt_segcust_snappy, has 21 columns and two partitions defined. Does this error look familiar to anyone? Could my usage of SparkSQL with Hive be incorrect or is support with Hive/Parquet/partitioning still buggy at this point in Spark 1.1.0? Thanks, -Terry