Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

Yin Huai Thu, 16 Oct 2014 07:10:07 -0700

Hello Terry,

I guess you hit this bug <https://issues.apache.org/jira/browse/SPARK-3559>.
The list of needed column ids was messed up. Can you try the master branch
or apply the code change
<https://github.com/apache/spark/commit/e10d71e7e58bf2ec0f1942cb2f0602396ab866b4>
to
your 1.1 and see if the problem is resolved?


Thanks,

Yin

On Wed, Oct 15, 2014 at 12:08 PM, Terry Siu <terry....@smartfocus.com>
wrote:

>  Hi Yin,
>
>  pqt_rdt_snappy has 76 columns. These two parquet tables were created via
> Hive 0.12 from existing Avro data using CREATE TABLE following by an INSERT
> OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition
> while pqt_segcust_snappy has two partitions. For pqt_segcust_snappy, I
> noticed that when I populated it with a single INSERT OVERWRITE over all
> the partitions and then executed the Spark code, it would report an illegal
> index value of 29.  However, if I manually did INSERT OVERWRITE for every
> single partition, I would get an illegal index value of 21. I don’t know if
> this will help in debugging, but here’s the DESCRIBE output for
> pqt_segcust_snappy:
>
>   OK
>
> col_name        data_type       comment
>
> customer_id             string                  from deserializer
>
> age_range               string                  from deserializer
>
> gender                  string                  from deserializer
>
> last_tx_date            bigint                  from deserializer
>
> last_tx_date_ts         string                  from deserializer
>
> last_tx_date_dt         string                  from deserializer
>
> first_tx_date           bigint                  from deserializer
>
> first_tx_date_ts        string                  from deserializer
>
> first_tx_date_dt        string                  from deserializer
>
> second_tx_date          bigint                  from deserializer
>
> second_tx_date_ts       string                  from deserializer
>
> second_tx_date_dt       string                  from deserializer
>
> third_tx_date           bigint                  from deserializer
>
> third_tx_date_ts        string                  from deserializer
>
> third_tx_date_dt        string                  from deserializer
>
> frequency               double                  from deserializer
>
> tx_size                 double                  from deserializer
>
> recency                 double                  from deserializer
>
> rfm                     double                  from deserializer
>
> tx_count                bigint                  from deserializer
>
> sales                   double                  from deserializer
>
> coll_def_id             string                  None
>
> seg_def_id              string                  None
>
>
>
> # Partition Information
>
> # col_name              data_type               comment
>
>
>
> coll_def_id             string                  None
>
> seg_def_id              string                  None
>
> Time taken: 0.788 seconds, Fetched: 29 row(s)
>
>
>  As you can see, I have 21 data columns, followed by the 2 partition
> columns, coll_def_id and seg_def_id. Output shows 29 rows, but that looks
> like it’s just counting the rows in the console output. Let me know if you
> need more information.
>
>
>  Thanks
>
> -Terry
>
>
>   From: Yin Huai <huaiyin....@gmail.com>
> Date: Tuesday, October 14, 2014 at 6:29 PM
> To: Terry Siu <terry....@smartfocus.com>
> Cc: Michael Armbrust <mich...@databricks.com>, "user@spark.apache.org" <
> user@spark.apache.org>
>
> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
>
>   Hello Terry,
>
>  How many columns does pqt_rdt_snappy have?
>
>  Thanks,
>
>  Yin
>
> On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu <terry....@smartfocus.com>
> wrote:
>
>>  Hi Michael,
>>
>>  That worked for me. At least I’m now further than I was. Thanks for the
>> tip!
>>
>>  -Terry
>>
>>   From: Michael Armbrust <mich...@databricks.com>
>> Date: Monday, October 13, 2014 at 5:05 PM
>> To: Terry Siu <terry....@smartfocus.com>
>> Cc: "user@spark.apache.org" <user@spark.apache.org>
>> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
>>
>>   There are some known bug with the parquet serde and spark 1.1.
>>
>>  You can try setting spark.sql.hive.convertMetastoreParquet=true to
>> cause spark sql to use built in parquet support when the serde looks like
>> parquet.
>>
>> On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu <terry....@smartfocus.com>
>> wrote:
>>
>>>  I am currently using Spark 1.1.0 that has been compiled against Hadoop
>>> 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external
>>> Hive tables that point to Parquet (compressed with Snappy), which were
>>> converted over from Avro if that matters.
>>>
>>>  I am trying to perform a join with these two Hive tables, but am
>>> encountering an exception. In a nutshell, I launch a spark shell, create my
>>> HiveContext (pointing to the correct metastore on our cluster), and then
>>> proceed to do the following:
>>>
>>>  scala> val hc = new HiveContext(sc)
>>>
>>>  scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate
>>> >= 1325376000000 and translate <= 1340063999999”)
>>>
>>>  scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where
>>> coll_def_id=‘abcd’”)
>>>
>>>  scala> txn.registerAsTable(“segTxns”)
>>>
>>>  scala> segcust.registerAsTable(“segCusts”)
>>>
>>>  scala> val joined = hc.sql(“select t.transid, c.customer_id from
>>> segTxns t join segCusts c on t.customerid=c.customer_id”)
>>>
>>>  Straight forward enough, but I get the following exception:
>>>
>>>  14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0
>>> (TID 51)
>>>
>>> java.lang.IndexOutOfBoundsException: Index: 21, Size: 21
>>>
>>>         at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>>>
>>>         at java.util.ArrayList.get(ArrayList.java:411)
>>>
>>>         at
>>> org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94)
>>>
>>>         at
>>> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206)
>>>
>>>         at
>>> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:81)
>>>
>>>         at
>>> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:67)
>>>
>>>         at
>>> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)
>>>
>>>         at
>>> org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:197)
>>>
>>>         at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)
>>>
>>>         at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)
>>>
>>>         at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>
>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>
>>>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>>
>>>         at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>
>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>
>>>         at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>>
>>>         at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>
>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>
>>>         at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>>>
>>>         at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>
>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>
>>>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>>
>>>         at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>
>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>
>>>         at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>>
>>>         at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>>
>>>         at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>>
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>
>>>
>>>  The number of columns in my table, pqt_segcust_snappy, has 21 columns
>>> and two partitions defined. Does this error look familiar to anyone? Could
>>> my usage of SparkSQL with Hive be incorrect or is support with
>>> Hive/Parquet/partitioning still buggy at this point in Spark 1.1.0?
>>>
>>>
>>>  Thanks,
>>>
>>> -Terry
>>>
>>
>>
>

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

Reply via email to