Hi Yin, Thanks for the clarification. My first reaction is that if this is the intended behavior, it is a wasted opportunity. Why create a managed table in Hive that cannot be read from inside Hive? I think I understand now that you are essentially piggybacking on Hive's metastore to persist table info between/across sessions, but I imagine others might expect more (as I have.)
We find ourselves wanting to do work in Spark and persist the results where other users (e.g. analysts using Tableau connected to Hive/Impala) can explore it. I imagine this is very common. I can, of course, save it as parquet and create an external table in hive (which I will do now), but saveAsTable seems much less useful to me now. Any other opinions? Cheers, C On Thu, Mar 19, 2015 at 9:18 AM, Yin Huai <yh...@databricks.com> wrote: > I meant table properties and serde properties are used to store metadata of > a Spark SQL data source table. We do not set other fields like SerDe lib. > For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source table > should not show unrelated stuff like Serde lib and InputFormat. I have > created https://issues.apache.org/jira/browse/SPARK-6413 to track the > improvement on the output of DESCRIBE statement. > > On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai <yh...@databricks.com> wrote: >> >> Hi Christian, >> >> Your table is stored correctly in Parquet format. >> >> For saveAsTable, the table created is not a Hive table, but a Spark SQL >> data source table >> (http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources). >> We are only using Hive's metastore to store the metadata (to be specific, >> only table properties and serde properties). When you look at table >> property, there will be a field called "spark.sql.sources.provider" and the >> value will be "org.apache.spark.sql.parquet.DefaultSource". You can also >> look at your files in the file system. They are stored by Parquet. >> >> Thanks, >> >> Yin >> >> On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez <christ...@svds.com> >> wrote: >>> >>> Hi all, >>> >>> DataFrame.saveAsTable creates a managed table in Hive (v0.13 on >>> CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong* >>> schema _and_ storage format in the Hive metastore, so that the table >>> cannot be read from inside Hive. Spark itself can read the table, but >>> Hive throws a Serialization error because it doesn't know it is >>> Parquet. >>> >>> val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education", >>> "income") >>> df.saveAsTable("spark_test_foo") >>> >>> Expected: >>> >>> COLUMNS( >>> education BIGINT, >>> income BIGINT >>> ) >>> >>> SerDe Library: >>> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe >>> InputFormat: >>> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat >>> >>> Actual: >>> >>> COLUMNS( >>> col array<string> COMMENT "from deserializer" >>> ) >>> >>> SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe >>> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat >>> >>> --- >>> >>> Manually changing schema and storage restores access in Hive and >>> doesn't affect Spark. Note also that Hive's table property >>> "spark.sql.sources.schema" is correct. At first glance, it looks like >>> the schema data is serialized when sent to Hive but not deserialized >>> properly on receive. >>> >>> I'm tracing execution through source code... but before I get any >>> deeper, can anyone reproduce this behavior? >>> >>> Cheers, >>> >>> Christian >>> >>> -- >>> Christian Perez >>> Silicon Valley Data Science >>> Data Analyst >>> christ...@svds.com >>> @cp_phd >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >> > -- Christian Perez Silicon Valley Data Science Data Analyst christ...@svds.com @cp_phd --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org