Hi all, DataFrame.saveAsTable creates a managed table in Hive (v0.13 on CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong* schema _and_ storage format in the Hive metastore, so that the table cannot be read from inside Hive. Spark itself can read the table, but Hive throws a Serialization error because it doesn't know it is Parquet.
val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education", "income") df.saveAsTable("spark_test_foo") Expected: COLUMNS( education BIGINT, income BIGINT ) SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat Actual: COLUMNS( col array<string> COMMENT "from deserializer" ) SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat --- Manually changing schema and storage restores access in Hive and doesn't affect Spark. Note also that Hive's table property "spark.sql.sources.schema" is correct. At first glance, it looks like the schema data is serialized when sent to Hive but not deserialized properly on receive. I'm tracing execution through source code... but before I get any deeper, can anyone reproduce this behavior? Cheers, Christian -- Christian Perez Silicon Valley Data Science Data Analyst christ...@svds.com @cp_phd --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org