Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-20 Thread Terry Siu
Hi Yin,

Sorry for the delay, but I’ll try the code change when I get a chance, but 
Michael’s initial response did solve my problem. In the meantime, I’m hitting 
another issue with SparkSQL which I will probably post another message if I 
can’t figure a workaround.

Thanks,
-Terry

From: Yin Huai mailto:huaiyin@gmail.com>>
Date: Thursday, October 16, 2014 at 7:08 AM
To: Terry Siu mailto:terry@smartfocus.com>>
Cc: Michael Armbrust mailto:mich...@databricks.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

Hello Terry,

I guess you hit this bug<https://issues.apache.org/jira/browse/SPARK-3559>. The 
list of needed column ids was messed up. Can you try the master branch or apply 
the code 
change<https://github.com/apache/spark/commit/e10d71e7e58bf2ec0f1942cb2f0602396ab866b4>
 to your 1.1 and see if the problem is resolved?

Thanks,

Yin

On Wed, Oct 15, 2014 at 12:08 PM, Terry Siu 
mailto:terry@smartfocus.com>> wrote:
Hi Yin,

pqt_rdt_snappy has 76 columns. These two parquet tables were created via Hive 
0.12 from existing Avro data using CREATE TABLE following by an INSERT 
OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition 
while pqt_segcust_snappy has two partitions. For pqt_segcust_snappy, I noticed 
that when I populated it with a single INSERT OVERWRITE over all the partitions 
and then executed the Spark code, it would report an illegal index value of 29. 
 However, if I manually did INSERT OVERWRITE for every single partition, I 
would get an illegal index value of 21. I don’t know if this will help in 
debugging, but here’s the DESCRIBE output for pqt_segcust_snappy:


OK

col_namedata_type   comment

customer_id string  from deserializer

age_range   string  from deserializer

gender  string  from deserializer

last_tx_datebigint  from deserializer

last_tx_date_ts string  from deserializer

last_tx_date_dt string  from deserializer

first_tx_date   bigint  from deserializer

first_tx_date_tsstring  from deserializer

first_tx_date_dtstring  from deserializer

second_tx_date  bigint  from deserializer

second_tx_date_ts   string  from deserializer

second_tx_date_dt   string  from deserializer

third_tx_date   bigint  from deserializer

third_tx_date_tsstring  from deserializer

third_tx_date_dtstring  from deserializer

frequency   double  from deserializer

tx_size double  from deserializer

recency double  from deserializer

rfm double  from deserializer

tx_countbigint  from deserializer

sales   double  from deserializer

coll_def_id string  None

seg_def_id  string  None



# Partition Information

# col_name  data_type   comment



coll_def_id string  None

seg_def_id  string  None

Time taken: 0.788 seconds, Fetched: 29 row(s)


As you can see, I have 21 data columns, followed by the 2 partition columns, 
coll_def_id and seg_def_id. Output shows 29 rows, but that looks like it’s just 
counting the rows in the console output. Let me know if you need more 
information.


Thanks

-Terry


From: Yin Huai mailto:huaiyin@gmail.com>>
Date: Tuesday, October 14, 2014 at 6:29 PM
To: Terry Siu mailto:terry@smartfocus.com>>
Cc: Michael Armbrust mailto:mich...@databricks.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>

Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

Hello Terry,

How many columns does pqt_rdt_snappy have?

Thanks,

Yin

On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu 
mailto:terry@smartfocus.com>> wrote:
Hi Michael,

That worked for me. At least I’m now further than I was. Thanks for the tip!

-Terry

From: Michael Armbrust mailto:mich...@databricks.com>>
Date: Monday, October 13, 2014 at 5:05 PM
To: Terry Siu mailto:terry@smartfocus.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

There are some known bug with the parquet serde and spark 1.1.

You can try setting spark.sql.hive.convertMet

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-16 Thread Yin Huai
Hello Terry,

I guess you hit this bug <https://issues.apache.org/jira/browse/SPARK-3559>.
The list of needed column ids was messed up. Can you try the master branch
or apply the code change
<https://github.com/apache/spark/commit/e10d71e7e58bf2ec0f1942cb2f0602396ab866b4>
to
your 1.1 and see if the problem is resolved?

Thanks,

Yin

On Wed, Oct 15, 2014 at 12:08 PM, Terry Siu 
wrote:

>  Hi Yin,
>
>  pqt_rdt_snappy has 76 columns. These two parquet tables were created via
> Hive 0.12 from existing Avro data using CREATE TABLE following by an INSERT
> OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition
> while pqt_segcust_snappy has two partitions. For pqt_segcust_snappy, I
> noticed that when I populated it with a single INSERT OVERWRITE over all
> the partitions and then executed the Spark code, it would report an illegal
> index value of 29.  However, if I manually did INSERT OVERWRITE for every
> single partition, I would get an illegal index value of 21. I don’t know if
> this will help in debugging, but here’s the DESCRIBE output for
> pqt_segcust_snappy:
>
>   OK
>
> col_namedata_type   comment
>
> customer_id string  from deserializer
>
> age_range   string  from deserializer
>
> gender  string  from deserializer
>
> last_tx_datebigint  from deserializer
>
> last_tx_date_ts string  from deserializer
>
> last_tx_date_dt string  from deserializer
>
> first_tx_date   bigint  from deserializer
>
> first_tx_date_tsstring  from deserializer
>
> first_tx_date_dtstring  from deserializer
>
> second_tx_date  bigint  from deserializer
>
> second_tx_date_ts   string  from deserializer
>
> second_tx_date_dt   string  from deserializer
>
> third_tx_date   bigint  from deserializer
>
> third_tx_date_tsstring  from deserializer
>
> third_tx_date_dtstring  from deserializer
>
> frequency   double  from deserializer
>
> tx_size double  from deserializer
>
> recency double  from deserializer
>
> rfm double  from deserializer
>
> tx_countbigint  from deserializer
>
> sales   double  from deserializer
>
> coll_def_id string  None
>
> seg_def_id  string  None
>
>
>
> # Partition Information
>
> # col_name  data_type   comment
>
>
>
> coll_def_id string  None
>
> seg_def_id  string  None
>
> Time taken: 0.788 seconds, Fetched: 29 row(s)
>
>
>  As you can see, I have 21 data columns, followed by the 2 partition
> columns, coll_def_id and seg_def_id. Output shows 29 rows, but that looks
> like it’s just counting the rows in the console output. Let me know if you
> need more information.
>
>
>  Thanks
>
> -Terry
>
>
>   From: Yin Huai 
> Date: Tuesday, October 14, 2014 at 6:29 PM
> To: Terry Siu 
> Cc: Michael Armbrust , "user@spark.apache.org" <
> user@spark.apache.org>
>
> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
>
>   Hello Terry,
>
>  How many columns does pqt_rdt_snappy have?
>
>  Thanks,
>
>  Yin
>
> On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu 
> wrote:
>
>>  Hi Michael,
>>
>>  That worked for me. At least I’m now further than I was. Thanks for the
>> tip!
>>
>>  -Terry
>>
>>   From: Michael Armbrust 
>> Date: Monday, October 13, 2014 at 5:05 PM
>> To: Terry Siu 
>> Cc: "user@spark.apache.org" 
>> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
>>
>>   There are some known bug with the parquet serde and spark 1.1.
>>
>>  You can try setting spark.sql.hive.convertMetastoreParquet=true to
>> cause spark sql to use built in parquet support when the serde looks like
>> parquet.
>>
>> On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu 
>> wrote:
>>
>>>  I am currently using Spark 1.1.0 that has been compiled against Hadoop
>>> 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external
>>> Hive tables that point to Parquet (

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-15 Thread Terry Siu
Hi Yin,

pqt_rdt_snappy has 76 columns. These two parquet tables were created via Hive 
0.12 from existing Avro data using CREATE TABLE following by an INSERT 
OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition 
while pqt_segcust_snappy has two partitions. For pqt_segcust_snappy, I noticed 
that when I populated it with a single INSERT OVERWRITE over all the partitions 
and then executed the Spark code, it would report an illegal index value of 29. 
 However, if I manually did INSERT OVERWRITE for every single partition, I 
would get an illegal index value of 21. I don’t know if this will help in 
debugging, but here’s the DESCRIBE output for pqt_segcust_snappy:


OK

col_namedata_type   comment

customer_id string  from deserializer

age_range   string  from deserializer

gender  string  from deserializer

last_tx_datebigint  from deserializer

last_tx_date_ts string  from deserializer

last_tx_date_dt string  from deserializer

first_tx_date   bigint  from deserializer

first_tx_date_tsstring  from deserializer

first_tx_date_dtstring  from deserializer

second_tx_date  bigint  from deserializer

second_tx_date_ts   string  from deserializer

second_tx_date_dt   string  from deserializer

third_tx_date   bigint  from deserializer

third_tx_date_tsstring  from deserializer

third_tx_date_dtstring  from deserializer

frequency   double  from deserializer

tx_size double  from deserializer

recency double  from deserializer

rfm double  from deserializer

tx_countbigint  from deserializer

sales   double  from deserializer

coll_def_id string  None

seg_def_id  string  None



# Partition Information

# col_name  data_type   comment



coll_def_id string  None

seg_def_id  string  None

Time taken: 0.788 seconds, Fetched: 29 row(s)


As you can see, I have 21 data columns, followed by the 2 partition columns, 
coll_def_id and seg_def_id. Output shows 29 rows, but that looks like it’s just 
counting the rows in the console output. Let me know if you need more 
information.


Thanks

-Terry


From: Yin Huai mailto:huaiyin@gmail.com>>
Date: Tuesday, October 14, 2014 at 6:29 PM
To: Terry Siu mailto:terry@smartfocus.com>>
Cc: Michael Armbrust mailto:mich...@databricks.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

Hello Terry,

How many columns does pqt_rdt_snappy have?

Thanks,

Yin

On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu 
mailto:terry@smartfocus.com>> wrote:
Hi Michael,

That worked for me. At least I’m now further than I was. Thanks for the tip!

-Terry

From: Michael Armbrust mailto:mich...@databricks.com>>
Date: Monday, October 13, 2014 at 5:05 PM
To: Terry Siu mailto:terry@smartfocus.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

There are some known bug with the parquet serde and spark 1.1.

You can try setting spark.sql.hive.convertMetastoreParquet=true to cause spark 
sql to use built in parquet support when the serde looks like parquet.

On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu 
mailto:terry@smartfocus.com>> wrote:
I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our 
cluster is CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables 
that point to Parquet (compressed with Snappy), which were converted over from 
Avro if that matters.

I am trying to perform a join with these two Hive tables, but am encountering 
an exception. In a nutshell, I launch a spark shell, create my HiveContext 
(pointing to the correct metastore on our cluster), and then proceed to do the 
following:

scala> val hc = new HiveContext(sc)

scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate >= 
132537600 and translate <= 134006399”)

scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where 
coll_def_id=‘abcd’”)

scala> txn.registerAsTable(“segTxns”)

scala> segcust.registerAsTable(“segCusts”)

scala> val joined = hc.sql(“select t.transid, c.customer_id fro

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-14 Thread Yin Huai
Hello Terry,

How many columns does pqt_rdt_snappy have?

Thanks,

Yin

On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu 
wrote:

>  Hi Michael,
>
>  That worked for me. At least I’m now further than I was. Thanks for the
> tip!
>
>  -Terry
>
>   From: Michael Armbrust 
> Date: Monday, October 13, 2014 at 5:05 PM
> To: Terry Siu 
> Cc: "user@spark.apache.org" 
> Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet
>
>   There are some known bug with the parquet serde and spark 1.1.
>
>  You can try setting spark.sql.hive.convertMetastoreParquet=true to cause
> spark sql to use built in parquet support when the serde looks like parquet.
>
> On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu 
> wrote:
>
>>  I am currently using Spark 1.1.0 that has been compiled against Hadoop
>> 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external
>> Hive tables that point to Parquet (compressed with Snappy), which were
>> converted over from Avro if that matters.
>>
>>  I am trying to perform a join with these two Hive tables, but am
>> encountering an exception. In a nutshell, I launch a spark shell, create my
>> HiveContext (pointing to the correct metastore on our cluster), and then
>> proceed to do the following:
>>
>>  scala> val hc = new HiveContext(sc)
>>
>>  scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate
>> >= 132537600 and translate <= 134006399”)
>>
>>  scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where
>> coll_def_id=‘abcd’”)
>>
>>  scala> txn.registerAsTable(“segTxns”)
>>
>>  scala> segcust.registerAsTable(“segCusts”)
>>
>>  scala> val joined = hc.sql(“select t.transid, c.customer_id from
>> segTxns t join segCusts c on t.customerid=c.customer_id”)
>>
>>  Straight forward enough, but I get the following exception:
>>
>>  14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0
>> (TID 51)
>>
>> java.lang.IndexOutOfBoundsException: Index: 21, Size: 21
>>
>> at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>>
>> at java.util.ArrayList.get(ArrayList.java:411)
>>
>> at
>> org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94)
>>
>> at
>> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206)
>>
>> at
>> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:81)
>>
>> at
>> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:67)
>>
>> at
>> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)
>>
>> at
>> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:197)
>>
>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)
>>
>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>> at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>
>> at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>
>>
>>  The number of columns in my table, pqt_segcust_snappy, has 21 columns
>> and two partitions defined. Does this error look familiar to anyone? Could
>> my usage of SparkSQL with Hive be incorrect or is support with
>> Hive/Parquet/partitioning still buggy at this point in Spark 1.1.0?
>>
>>
>>  Thanks,
>>
>> -Terry
>>
>
>


Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-14 Thread Terry Siu
Hi Michael,

That worked for me. At least I’m now further than I was. Thanks for the tip!

-Terry

From: Michael Armbrust mailto:mich...@databricks.com>>
Date: Monday, October 13, 2014 at 5:05 PM
To: Terry Siu mailto:terry@smartfocus.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

There are some known bug with the parquet serde and spark 1.1.

You can try setting spark.sql.hive.convertMetastoreParquet=true to cause spark 
sql to use built in parquet support when the serde looks like parquet.

On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu 
mailto:terry@smartfocus.com>> wrote:
I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our 
cluster is CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables 
that point to Parquet (compressed with Snappy), which were converted over from 
Avro if that matters.

I am trying to perform a join with these two Hive tables, but am encountering 
an exception. In a nutshell, I launch a spark shell, create my HiveContext 
(pointing to the correct metastore on our cluster), and then proceed to do the 
following:

scala> val hc = new HiveContext(sc)

scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate >= 
132537600 and translate <= 134006399”)

scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where 
coll_def_id=‘abcd’”)

scala> txn.registerAsTable(“segTxns”)

scala> segcust.registerAsTable(“segCusts”)

scala> val joined = hc.sql(“select t.transid, c.customer_id from segTxns t join 
segCusts c on t.customerid=c.customer_id”)

Straight forward enough, but I get the following exception:


14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0 (TID 51)

java.lang.IndexOutOfBoundsException: Index: 21, Size: 21

at java.util.ArrayList.rangeCheck(ArrayList.java:635)

at java.util.ArrayList.get(ArrayList.java:411)

at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94)

at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206)

at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:81)

at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:67)

at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)

at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:197)

at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)

at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

at org.apache.spark.scheduler.Task.run(Task.scala:54)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)


The number of columns in my table, pqt_segcust_snappy, has 21 columns and two 
partitions defined. Does this error look familiar to anyone? Could my usage of 
SparkSQL with Hive be incorrect or is support with Hive/Parquet/partitioning 
still buggy at this point in Spark 1.1.0?


Thanks,

-Terry





Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-13 Thread Michael Armbrust
There are some known bug with the parquet serde and spark 1.1.

You can try setting spark.sql.hive.convertMetastoreParquet=true to cause
spark sql to use built in parquet support when the serde looks like parquet.

On Mon, Oct 13, 2014 at 2:57 PM, Terry Siu  wrote:

>  I am currently using Spark 1.1.0 that has been compiled against Hadoop
> 2.3. Our cluster is CDH5.1.2 which is runs Hive 0.12. I have two external
> Hive tables that point to Parquet (compressed with Snappy), which were
> converted over from Avro if that matters.
>
>  I am trying to perform a join with these two Hive tables, but am
> encountering an exception. In a nutshell, I launch a spark shell, create my
> HiveContext (pointing to the correct metastore on our cluster), and then
> proceed to do the following:
>
>  scala> val hc = new HiveContext(sc)
>
>  scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate >=
> 132537600 and translate <= 134006399”)
>
>  scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where
> coll_def_id=‘abcd’”)
>
>  scala> txn.registerAsTable(“segTxns”)
>
>  scala> segcust.registerAsTable(“segCusts”)
>
>  scala> val joined = hc.sql(“select t.transid, c.customer_id from segTxns
> t join segCusts c on t.customerid=c.customer_id”)
>
>  Straight forward enough, but I get the following exception:
>
>   14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0
> (TID 51)
>
> java.lang.IndexOutOfBoundsException: Index: 21, Size: 21
>
> at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>
> at java.util.ArrayList.get(ArrayList.java:411)
>
> at
> org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94)
>
> at
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206)
>
> at
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:81)
>
> at
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:67)
>
> at
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:197)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>
>  The number of columns in my table, pqt_segcust_snappy, has 21 columns
> and two partitions defined. Does this error look familiar to anyone? Could
> my usage of SparkSQL with Hive be incorrect or is support with
> Hive/Parquet/partitioning still buggy at this point in Spark 1.1.0?
>
>
>  Thanks,
>
> -Terry
>


SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-13 Thread Terry Siu
I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our 
cluster is CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables 
that point to Parquet (compressed with Snappy), which were converted over from 
Avro if that matters.

I am trying to perform a join with these two Hive tables, but am encountering 
an exception. In a nutshell, I launch a spark shell, create my HiveContext 
(pointing to the correct metastore on our cluster), and then proceed to do the 
following:

scala> val hc = new HiveContext(sc)

scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate >= 
132537600 and translate <= 134006399”)

scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where 
coll_def_id=‘abcd’”)

scala> txn.registerAsTable(“segTxns”)

scala> segcust.registerAsTable(“segCusts”)

scala> val joined = hc.sql(“select t.transid, c.customer_id from segTxns t join 
segCusts c on t.customerid=c.customer_id”)

Straight forward enough, but I get the following exception:


14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0 (TID 51)

java.lang.IndexOutOfBoundsException: Index: 21, Size: 21

at java.util.ArrayList.rangeCheck(ArrayList.java:635)

at java.util.ArrayList.get(ArrayList.java:411)

at 
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94)

at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206)

at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:81)

at 
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:67)

at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)

at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:197)

at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)

at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

at org.apache.spark.scheduler.Task.run(Task.scala:54)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)


The number of columns in my table, pqt_segcust_snappy, has 21 columns and two 
partitions defined. Does this error look familiar to anyone? Could my usage of 
SparkSQL with Hive be incorrect or is support with Hive/Parquet/partitioning 
still buggy at this point in Spark 1.1.0?


Thanks,

-Terry