Hi Christian, Your table is stored correctly in Parquet format.
For saveAsTable, the table created is *not* a Hive table, but a Spark SQL data source table ( http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources). We are only using Hive's metastore to store the metadata (to be specific, only table properties and serde properties). When you look at table property, there will be a field called "spark.sql.sources.provider" and the value will be "org.apache.spark.sql.parquet.DefaultSource". You can also look at your files in the file system. They are stored by Parquet. Thanks, Yin On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez <christ...@svds.com> wrote: > Hi all, > > DataFrame.saveAsTable creates a managed table in Hive (v0.13 on > CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong* > schema _and_ storage format in the Hive metastore, so that the table > cannot be read from inside Hive. Spark itself can read the table, but > Hive throws a Serialization error because it doesn't know it is > Parquet. > > val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education", "income") > df.saveAsTable("spark_test_foo") > > Expected: > > COLUMNS( > education BIGINT, > income BIGINT > ) > > SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > > Actual: > > COLUMNS( > col array<string> COMMENT "from deserializer" > ) > > SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe > InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat > > --- > > Manually changing schema and storage restores access in Hive and > doesn't affect Spark. Note also that Hive's table property > "spark.sql.sources.schema" is correct. At first glance, it looks like > the schema data is serialized when sent to Hive but not deserialized > properly on receive. > > I'm tracing execution through source code... but before I get any > deeper, can anyone reproduce this behavior? > > Cheers, > > Christian > > -- > Christian Perez > Silicon Valley Data Science > Data Analyst > christ...@svds.com > @cp_phd > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >