We are using Spark 1.6.2 as ETL to generate parquet file for one dataset, and
partitioned by "brand" (which is a string to represent brand in this dataset).
After the partition files generated in HDFS like "brand=a" folder, we add the
partitions in the Hive.
The hive version is 1.2.1 (In fact, we are using HDP 2.5.0).
Now the problem is that for 2 brand partitions, we cannot query the data
generated in Spark, but it works fine for the rest of partitions.
Below is the error in the Hive CLI and hive.log I got if I query the bad
partitions like "select * from tablename where brand='BrandA' limit 3;"
Failed with exception
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.UnsupportedOperationException: Cannot inspect
org.apache.hadoop.io.LongWritable
Caused by: java.lang.UnsupportedOperationException: Cannot inspect
org.apache.hadoop.io.LongWritable
at
org.apache.hadoop.hive.ql.io.parquet.serde.primitive.ParquetStringInspector.getPrimitiveWritableObject(ParquetStringInspector.java:52)
at
org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:222)
at
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:307)
at
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:262)
at
org.apache.hadoop.hive.serde2.DelimitedJSONSerDe.serializeField(DelimitedJSONSerDe.java:72)
at
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:246)
at
org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:50)
at
org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert(DefaultFetchFormatter.java:71)
at
org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert(DefaultFetchFormatter.java:40)
at
org.apache.hadoop.hive.ql.exec.ListSinkOperator.process(ListSinkOperator.java:90)
... 22 more
There are not too much I can find by googling this error message, but it points
to that the schema in Hive is different as in parquet file.
But this is a very strange case, as the same schema works fine for other
brands, which defined as a partition column, and share the whole Hive schema as
the above.
If I query like: "select * from tablename where brand='BrandB' limit 3:",
everything works fine.
So is this really caused by the Hive schema mismatch with parquet file
generated by Spark, or by the data within different partitioned keys, or really
a compatible issue between Spark/Hive?
Thanks
Yong