Any other users interested in a feature DataFrame.saveAsExternalTable() for making _useful_ external tables in Hive, or am I the only one? Bueller? If I start a PR for this, will it be taken seriously?
On Thu, Mar 19, 2015 at 9:34 AM, Christian Perez <christ...@svds.com> wrote: > Hi Yin, > > Thanks for the clarification. My first reaction is that if this is the > intended behavior, it is a wasted opportunity. Why create a managed > table in Hive that cannot be read from inside Hive? I think I > understand now that you are essentially piggybacking on Hive's > metastore to persist table info between/across sessions, but I imagine > others might expect more (as I have.) > > We find ourselves wanting to do work in Spark and persist the results > where other users (e.g. analysts using Tableau connected to > Hive/Impala) can explore it. I imagine this is very common. I can, of > course, save it as parquet and create an external table in hive (which > I will do now), but saveAsTable seems much less useful to me now. > > Any other opinions? > > Cheers, > > C > > On Thu, Mar 19, 2015 at 9:18 AM, Yin Huai <yh...@databricks.com> wrote: >> I meant table properties and serde properties are used to store metadata of >> a Spark SQL data source table. We do not set other fields like SerDe lib. >> For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source table >> should not show unrelated stuff like Serde lib and InputFormat. I have >> created https://issues.apache.org/jira/browse/SPARK-6413 to track the >> improvement on the output of DESCRIBE statement. >> >> On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai <yh...@databricks.com> wrote: >>> >>> Hi Christian, >>> >>> Your table is stored correctly in Parquet format. >>> >>> For saveAsTable, the table created is not a Hive table, but a Spark SQL >>> data source table >>> (http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources). >>> We are only using Hive's metastore to store the metadata (to be specific, >>> only table properties and serde properties). When you look at table >>> property, there will be a field called "spark.sql.sources.provider" and the >>> value will be "org.apache.spark.sql.parquet.DefaultSource". You can also >>> look at your files in the file system. They are stored by Parquet. >>> >>> Thanks, >>> >>> Yin >>> >>> On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez <christ...@svds.com> >>> wrote: >>>> >>>> Hi all, >>>> >>>> DataFrame.saveAsTable creates a managed table in Hive (v0.13 on >>>> CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong* >>>> schema _and_ storage format in the Hive metastore, so that the table >>>> cannot be read from inside Hive. Spark itself can read the table, but >>>> Hive throws a Serialization error because it doesn't know it is >>>> Parquet. >>>> >>>> val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education", >>>> "income") >>>> df.saveAsTable("spark_test_foo") >>>> >>>> Expected: >>>> >>>> COLUMNS( >>>> education BIGINT, >>>> income BIGINT >>>> ) >>>> >>>> SerDe Library: >>>> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe >>>> InputFormat: >>>> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat >>>> >>>> Actual: >>>> >>>> COLUMNS( >>>> col array<string> COMMENT "from deserializer" >>>> ) >>>> >>>> SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe >>>> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat >>>> >>>> --- >>>> >>>> Manually changing schema and storage restores access in Hive and >>>> doesn't affect Spark. Note also that Hive's table property >>>> "spark.sql.sources.schema" is correct. At first glance, it looks like >>>> the schema data is serialized when sent to Hive but not deserialized >>>> properly on receive. >>>> >>>> I'm tracing execution through source code... but before I get any >>>> deeper, can anyone reproduce this behavior? >>>> >>>> Cheers, >>>> >>>> Christian >>>> >>>> -- >>>> Christian Perez >>>> Silicon Valley Data Science >>>> Data Analyst >>>> christ...@svds.com >>>> @cp_phd >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>> >> > > > > -- > Christian Perez > Silicon Valley Data Science > Data Analyst > christ...@svds.com > @cp_phd -- Christian Perez Silicon Valley Data Science Data Analyst christ...@svds.com @cp_phd --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org