The issue is cased by the data, and indeed a type miss match between Hive schema and Spark. Now it is fixed.
Without that kind of data, the problem won't be trigged in some brands. Thanks taking a look of this problem. Yong ________________________________ From: ayan guha <guha.a...@gmail.com> Sent: Tuesday, June 13, 2017 1:54 AM To: Angel Francisco Orta Cc: Yong Zhang; user@spark.apache.org Subject: Re: Parquet file generated by Spark, but not compatible read by Hive Try setting following Param: conf.set("spark.sql.hive.convertMetastoreParquet","false") On Tue, Jun 13, 2017 at 3:34 PM, Angel Francisco Orta <angel.francisco.o...@gmail.com<mailto:angel.francisco.o...@gmail.com>> wrote: Hello, Do you use df.write or you make with hivecontext.sql(" insert into ...")? Angel. El 12 jun. 2017 11:07 p. m., "Yong Zhang" <java8...@hotmail.com<mailto:java8...@hotmail.com>> escribió: We are using Spark 1.6.2 as ETL to generate parquet file for one dataset, and partitioned by "brand" (which is a string to represent brand in this dataset). After the partition files generated in HDFS like "brand=a" folder, we add the partitions in the Hive. The hive version is 1.2.1 (In fact, we are using HDP 2.5.0). Now the problem is that for 2 brand partitions, we cannot query the data generated in Spark, but it works fine for the rest of partitions. Below is the error in the Hive CLI and hive.log I got if I query the bad partitions like "select * from tablename where brand='BrandA' limit 3;" Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.io<http://org.apache.hadoop.io>.LongWritable Caused by: java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.io<http://org.apache.hadoop.io>.LongWritable at org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io>.parquet.serde.primitive.ParquetStringInspector.getPrimitiveWritableObject(ParquetStringInspector.java:52) at org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:222) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:307) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:262) at org.apache.hadoop.hive.serde2.DelimitedJSONSerDe.serializeField(DelimitedJSONSerDe.java:72) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:246) at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:50) at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert(DefaultFetchFormatter.java:71) at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert(DefaultFetchFormatter.java:40) at org.apache.hadoop.hive.ql.exec.ListSinkOperator.process(ListSinkOperator.java:90) ... 22 more There are not too much I can find by googling this error message, but it points to that the schema in Hive is different as in parquet file. But this is a very strange case, as the same schema works fine for other brands, which defined as a partition column, and share the whole Hive schema as the above. If I query like: "select * from tablename where brand='BrandB' limit 3:", everything works fine. So is this really caused by the Hive schema mismatch with parquet file generated by Spark, or by the data within different partitioned keys, or really a compatible issue between Spark/Hive? Thanks Yong -- Best Regards, Ayan Guha