I meant table properties and serde properties are used to store metadata of a Spark SQL data source table. We do not set other fields like SerDe lib. For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source table should not show unrelated stuff like Serde lib and InputFormat. I have created https://issues.apache.org/jira/browse/SPARK-6413 to track the improvement on the output of DESCRIBE statement.
On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai <yh...@databricks.com> wrote: > Hi Christian, > > Your table is stored correctly in Parquet format. > > For saveAsTable, the table created is *not* a Hive table, but a Spark SQL > data source table ( > http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources). > We are only using Hive's metastore to store the metadata (to be specific, > only table properties and serde properties). When you look at table > property, there will be a field called "spark.sql.sources.provider" and the > value will be "org.apache.spark.sql.parquet.DefaultSource". You can also > look at your files in the file system. They are stored by Parquet. > > Thanks, > > Yin > > On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez <christ...@svds.com> > wrote: > >> Hi all, >> >> DataFrame.saveAsTable creates a managed table in Hive (v0.13 on >> CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong* >> schema _and_ storage format in the Hive metastore, so that the table >> cannot be read from inside Hive. Spark itself can read the table, but >> Hive throws a Serialization error because it doesn't know it is >> Parquet. >> >> val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education", "income") >> df.saveAsTable("spark_test_foo") >> >> Expected: >> >> COLUMNS( >> education BIGINT, >> income BIGINT >> ) >> >> SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe >> InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat >> >> Actual: >> >> COLUMNS( >> col array<string> COMMENT "from deserializer" >> ) >> >> SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe >> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat >> >> --- >> >> Manually changing schema and storage restores access in Hive and >> doesn't affect Spark. Note also that Hive's table property >> "spark.sql.sources.schema" is correct. At first glance, it looks like >> the schema data is serialized when sent to Hive but not deserialized >> properly on receive. >> >> I'm tracing execution through source code... but before I get any >> deeper, can anyone reproduce this behavior? >> >> Cheers, >> >> Christian >> >> -- >> Christian Perez >> Silicon Valley Data Science >> Data Analyst >> christ...@svds.com >> @cp_phd >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >