Parquet file generated by Spark, but not compatible read by Hive

2017-06-12 Thread Yong Zhang
We are using Spark 1.6.2 as ETL to generate parquet file for one dataset, and 
partitioned by "brand" (which is a string to represent brand in this dataset).


After the partition files generated in HDFS like "brand=a" folder, we add the 
partitions in the Hive.


The hive version is 1.2.1 (In fact, we are using HDP 2.5.0).


Now the problem is that for 2 brand partitions, we cannot query the data 
generated in Spark, but it works fine for the rest of partitions.


Below is the error in the Hive CLI and hive.log I got if I query the bad 
partitions like "select * from  tablename where brand='BrandA' limit 3;"


Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.UnsupportedOperationException: Cannot inspect 
org.apache.hadoop.io.LongWritable


Caused by: java.lang.UnsupportedOperationException: Cannot inspect 
org.apache.hadoop.io.LongWritable
at 
org.apache.hadoop.hive.ql.io.parquet.serde.primitive.ParquetStringInspector.getPrimitiveWritableObject(ParquetStringInspector.java:52)
at 
org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:222)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:307)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:262)
at 
org.apache.hadoop.hive.serde2.DelimitedJSONSerDe.serializeField(DelimitedJSONSerDe.java:72)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:246)
at 
org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:50)
at 
org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert(DefaultFetchFormatter.java:71)
at 
org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert(DefaultFetchFormatter.java:40)
at 
org.apache.hadoop.hive.ql.exec.ListSinkOperator.process(ListSinkOperator.java:90)
... 22 more

There are not too much I can find by googling this error message, but it points 
to that the schema in Hive is different as in parquet file.
But this is a very strange case, as the same schema works fine for other 
brands, which defined as a partition column, and share the whole Hive schema as 
the above.

If I query like: "select * from tablename where brand='BrandB' limit 3:", 
everything works fine.

So is this really caused by the Hive schema mismatch with parquet file 
generated by Spark, or by the data within different partitioned keys, or really 
a compatible issue between Spark/Hive?

Thanks

Yong




Re: Parquet file generated by Spark, but not compatible read by Hive

2017-06-12 Thread Angel Francisco Orta
Hello,

Do you use df.write or you make with hivecontext.sql(" insert into ...")?

Angel.

El 12 jun. 2017 11:07 p. m., "Yong Zhang"  escribió:

> We are using Spark *1.6.2* as ETL to generate parquet file for one
> dataset, and partitioned by "brand" (which is a string to represent brand
> in this dataset).
>
>
> After the partition files generated in HDFS like "brand=a" folder, we add
> the partitions in the Hive.
>
>
> The hive version is *1.2.1 *(In fact, we are using HDP 2.5.0).
>
>
> Now the problem is that for 2 brand partitions, we cannot query the data
> generated in Spark, but it works fine for the rest of partitions.
>
>
> Below is the error in the Hive CLI and hive.log I got if I query the bad
> partitions like "select * from  tablename where brand='*BrandA*' limit 3;"
>
>
> Failed with exception 
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
> java.lang.UnsupportedOperationException: Cannot inspect
> org.apache.hadoop.io.LongWritable
>
>
> Caused by: java.lang.UnsupportedOperationException: Cannot inspect
> org.apache.hadoop.io.LongWritable
> at org.apache.hadoop.hive.ql.io.parquet.serde.primitive.
> ParquetStringInspector.getPrimitiveWritableObject(
> ParquetStringInspector.java:52)
> at org.apache.hadoop.hive.serde2.lazy.LazyUtils.
> writePrimitiveUTF8(LazyUtils.java:222)
> at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.
> serialize(LazySimpleSerDe.java:307)
> at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(
> LazySimpleSerDe.java:262)
> at org.apache.hadoop.hive.serde2.DelimitedJSONSerDe.serializeField(
> DelimitedJSONSerDe.java:72)
> at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.
> doSerialize(LazySimpleSerDe.java:246)
> at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(
> AbstractEncodingAwareSerDe.java:50)
> at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.
> convert(DefaultFetchFormatter.java:71)
> at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.
> convert(DefaultFetchFormatter.java:40)
> at org.apache.hadoop.hive.ql.exec.ListSinkOperator.process(
> ListSinkOperator.java:90)
> ... 22 more
>
> There are not too much I can find by googling this error message, but it
> points to that the schema in Hive is different as in parquet file.
> But this is a very strange case, as the same schema works fine for other
> brands, which defined as a partition column, and share the whole Hive
> schema as the above.
>
> If I query like: "select * from tablename where brand='*BrandB*' limit
> 3:", everything works fine.
>
> So is this really caused by the Hive schema mismatch with parquet file
> generated by Spark, or by the data within different partitioned keys, or
> really a compatible issue between Spark/Hive?
>
> Thanks
>
> Yong
>
>
>


Re: Parquet file generated by Spark, but not compatible read by Hive

2017-06-12 Thread ayan guha
Try setting following Param:

conf.set("spark.sql.hive.convertMetastoreParquet","false")

On Tue, Jun 13, 2017 at 3:34 PM, Angel Francisco Orta <
angel.francisco.o...@gmail.com> wrote:

> Hello,
>
> Do you use df.write or you make with hivecontext.sql(" insert into ...")?
>
> Angel.
>
> El 12 jun. 2017 11:07 p. m., "Yong Zhang"  escribió:
>
>> We are using Spark *1.6.2* as ETL to generate parquet file for one
>> dataset, and partitioned by "brand" (which is a string to represent brand
>> in this dataset).
>>
>>
>> After the partition files generated in HDFS like "brand=a" folder, we add
>> the partitions in the Hive.
>>
>>
>> The hive version is *1.2.1 *(In fact, we are using HDP 2.5.0).
>>
>>
>> Now the problem is that for 2 brand partitions, we cannot query the data
>> generated in Spark, but it works fine for the rest of partitions.
>>
>>
>> Below is the error in the Hive CLI and hive.log I got if I query the bad
>> partitions like "select * from  tablename where brand='*BrandA*' limit
>> 3;"
>>
>>
>> Failed with exception java.io.IOException:org.apache
>> .hadoop.hive.ql.metadata.HiveException: 
>> java.lang.UnsupportedOperationException:
>> Cannot inspect org.apache.hadoop.io.LongWritable
>>
>>
>> Caused by: java.lang.UnsupportedOperationException: Cannot inspect
>> org.apache.hadoop.io.LongWritable
>> at org.apache.hadoop.hive.ql.io.parquet.serde.primitive.Parquet
>> StringInspector.getPrimitiveWritableObject(ParquetStringInsp
>> ector.java:52)
>> at org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveU
>> TF8(LazyUtils.java:222)
>> at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize
>> (LazySimpleSerDe.java:307)
>> at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize
>> Field(LazySimpleSerDe.java:262)
>> at org.apache.hadoop.hive.serde2.DelimitedJSONSerDe.serializeFi
>> eld(DelimitedJSONSerDe.java:72)
>> at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSeriali
>> ze(LazySimpleSerDe.java:246)
>> at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.ser
>> ialize(AbstractEncodingAwareSerDe.java:50)
>> at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert
>> (DefaultFetchFormatter.java:71)
>> at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert
>> (DefaultFetchFormatter.java:40)
>> at org.apache.hadoop.hive.ql.exec.ListSinkOperator.process(List
>> SinkOperator.java:90)
>> ... 22 more
>>
>> There are not too much I can find by googling this error message, but it
>> points to that the schema in Hive is different as in parquet file.
>> But this is a very strange case, as the same schema works fine for other
>> brands, which defined as a partition column, and share the whole Hive
>> schema as the above.
>>
>> If I query like: "select * from tablename where brand='*BrandB*' limit
>> 3:", everything works fine.
>>
>> So is this really caused by the Hive schema mismatch with parquet file
>> generated by Spark, or by the data within different partitioned keys, or
>> really a compatible issue between Spark/Hive?
>>
>> Thanks
>>
>> Yong
>>
>>
>>


-- 
Best Regards,
Ayan Guha


Re: Parquet file generated by Spark, but not compatible read by Hive

2017-06-13 Thread Yong Zhang
The issue is cased by the data, and indeed a type miss match between Hive 
schema and Spark. Now it is fixed.


Without that kind of data, the problem won't be trigged in some brands.


Thanks taking a look of this problem.


Yong



From: ayan guha 
Sent: Tuesday, June 13, 2017 1:54 AM
To: Angel Francisco Orta
Cc: Yong Zhang; user@spark.apache.org
Subject: Re: Parquet file generated by Spark, but not compatible read by Hive

Try setting following Param:

conf.set("spark.sql.hive.convertMetastoreParquet","false")

On Tue, Jun 13, 2017 at 3:34 PM, Angel Francisco Orta 
mailto:angel.francisco.o...@gmail.com>> wrote:
Hello,

Do you use df.write or you make with hivecontext.sql(" insert into ...")?

Angel.

El 12 jun. 2017 11:07 p. m., "Yong Zhang" 
mailto:java8...@hotmail.com>> escribió:

We are using Spark 1.6.2 as ETL to generate parquet file for one dataset, and 
partitioned by "brand" (which is a string to represent brand in this dataset).


After the partition files generated in HDFS like "brand=a" folder, we add the 
partitions in the Hive.


The hive version is 1.2.1 (In fact, we are using HDP 2.5.0).


Now the problem is that for 2 brand partitions, we cannot query the data 
generated in Spark, but it works fine for the rest of partitions.


Below is the error in the Hive CLI and hive.log I got if I query the bad 
partitions like "select * from  tablename where brand='BrandA' limit 3;"


Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.UnsupportedOperationException: Cannot inspect 
org.apache.hadoop.io<http://org.apache.hadoop.io>.LongWritable


Caused by: java.lang.UnsupportedOperationException: Cannot inspect 
org.apache.hadoop.io<http://org.apache.hadoop.io>.LongWritable
at 
org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io>.parquet.serde.primitive.ParquetStringInspector.getPrimitiveWritableObject(ParquetStringInspector.java:52)
at 
org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:222)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:307)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:262)
at 
org.apache.hadoop.hive.serde2.DelimitedJSONSerDe.serializeField(DelimitedJSONSerDe.java:72)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:246)
at 
org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:50)
at 
org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert(DefaultFetchFormatter.java:71)
at 
org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert(DefaultFetchFormatter.java:40)
at 
org.apache.hadoop.hive.ql.exec.ListSinkOperator.process(ListSinkOperator.java:90)
... 22 more

There are not too much I can find by googling this error message, but it points 
to that the schema in Hive is different as in parquet file.
But this is a very strange case, as the same schema works fine for other 
brands, which defined as a partition column, and share the whole Hive schema as 
the above.

If I query like: "select * from tablename where brand='BrandB' limit 3:", 
everything works fine.

So is this really caused by the Hive schema mismatch with parquet file 
generated by Spark, or by the data within different partitioned keys, or really 
a compatible issue between Spark/Hive?

Thanks

Yong





--
Best Regards,
Ayan Guha