Re: Parquet file generated by Spark, but not compatible read by Hive

ayan guha Mon, 12 Jun 2017 22:55:09 -0700

Try setting following Param:

conf.set("spark.sql.hive.convertMetastoreParquet","false")


On Tue, Jun 13, 2017 at 3:34 PM, Angel Francisco Orta <
angel.francisco.o...@gmail.com> wrote:

> Hello,
>
> Do you use df.write or you make with hivecontext.sql(" insert into ...")?
>
> Angel.
>
> El 12 jun. 2017 11:07 p. m., "Yong Zhang" <java8...@hotmail.com> escribió:
>
>> We are using Spark *1.6.2* as ETL to generate parquet file for one
>> dataset, and partitioned by "brand" (which is a string to represent brand
>> in this dataset).
>>
>>
>> After the partition files generated in HDFS like "brand=a" folder, we add
>> the partitions in the Hive.
>>
>>
>> The hive version is *1.2.1 *(In fact, we are using HDP 2.5.0).
>>
>>
>> Now the problem is that for 2 brand partitions, we cannot query the data
>> generated in Spark, but it works fine for the rest of partitions.
>>
>>
>> Below is the error in the Hive CLI and hive.log I got if I query the bad
>> partitions like "select * from  tablename where brand='*BrandA*' limit
>> 3;"
>>
>>
>> Failed with exception java.io.IOException:org.apache
>> .hadoop.hive.ql.metadata.HiveException: 
>> java.lang.UnsupportedOperationException:
>> Cannot inspect org.apache.hadoop.io.LongWritable
>>
>>
>> Caused by: java.lang.UnsupportedOperationException: Cannot inspect
>> org.apache.hadoop.io.LongWritable
>>     at org.apache.hadoop.hive.ql.io.parquet.serde.primitive.Parquet
>> StringInspector.getPrimitiveWritableObject(ParquetStringInsp
>> ector.java:52)
>>     at org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveU
>> TF8(LazyUtils.java:222)
>>     at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize
>> (LazySimpleSerDe.java:307)
>>     at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize
>> Field(LazySimpleSerDe.java:262)
>>     at org.apache.hadoop.hive.serde2.DelimitedJSONSerDe.serializeFi
>> eld(DelimitedJSONSerDe.java:72)
>>     at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSeriali
>> ze(LazySimpleSerDe.java:246)
>>     at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.ser
>> ialize(AbstractEncodingAwareSerDe.java:50)
>>     at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert
>> (DefaultFetchFormatter.java:71)
>>     at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert
>> (DefaultFetchFormatter.java:40)
>>     at org.apache.hadoop.hive.ql.exec.ListSinkOperator.process(List
>> SinkOperator.java:90)
>>     ... 22 more
>>
>> There are not too much I can find by googling this error message, but it
>> points to that the schema in Hive is different as in parquet file.
>> But this is a very strange case, as the same schema works fine for other
>> brands, which defined as a partition column, and share the whole Hive
>> schema as the above.
>>
>> If I query like: "select * from tablename where brand='*BrandB*' limit
>> 3:", everything works fine.
>>
>> So is this really caused by the Hive schema mismatch with parquet file
>> generated by Spark, or by the data within different partitioned keys, or
>> really a compatible issue between Spark/Hive?
>>
>> Thanks
>>
>> Yong
>>
>>
>>


-- 
Best Regards,
Ayan Guha

Re: Parquet file generated by Spark, but not compatible read by Hive

Reply via email to