I believe that you can get what you want by using HiveQL instead of the pure programatic API. This is a little verbose so perhaps a specialized function would also be useful here. I'm not sure I would call it saveAsExternalTable as there are also "external" spark sql data source tables that have nothing to do with hive.
The following should create a proper hive table: df.registerTempTable("df") sqlContext.sql("CREATE TABLE newTable AS SELECT * FROM df") At the very least we should clarify in the documentation to avoid future confusion. The piggybacking is a little unfortunate but also gives us a lot of new functionality that we can't get when strictly following the way that Hive expects tables to be formatted. I'd suggest opening a JIRA for the specialized method you describe. Feel free to mention me and Yin in a comment when create you it. On Fri, Mar 20, 2015 at 12:55 PM, Christian Perez <christ...@svds.com> wrote: > Any other users interested in a feature > DataFrame.saveAsExternalTable() for making _useful_ external tables in > Hive, or am I the only one? Bueller? If I start a PR for this, will it > be taken seriously? > > On Thu, Mar 19, 2015 at 9:34 AM, Christian Perez <christ...@svds.com> > wrote: > > Hi Yin, > > > > Thanks for the clarification. My first reaction is that if this is the > > intended behavior, it is a wasted opportunity. Why create a managed > > table in Hive that cannot be read from inside Hive? I think I > > understand now that you are essentially piggybacking on Hive's > > metastore to persist table info between/across sessions, but I imagine > > others might expect more (as I have.) > > > > We find ourselves wanting to do work in Spark and persist the results > > where other users (e.g. analysts using Tableau connected to > > Hive/Impala) can explore it. I imagine this is very common. I can, of > > course, save it as parquet and create an external table in hive (which > > I will do now), but saveAsTable seems much less useful to me now. > > > > Any other opinions? > > > > Cheers, > > > > C > > > > On Thu, Mar 19, 2015 at 9:18 AM, Yin Huai <yh...@databricks.com> wrote: > >> I meant table properties and serde properties are used to store > metadata of > >> a Spark SQL data source table. We do not set other fields like SerDe > lib. > >> For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source > table > >> should not show unrelated stuff like Serde lib and InputFormat. I have > >> created https://issues.apache.org/jira/browse/SPARK-6413 to track the > >> improvement on the output of DESCRIBE statement. > >> > >> On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai <yh...@databricks.com> > wrote: > >>> > >>> Hi Christian, > >>> > >>> Your table is stored correctly in Parquet format. > >>> > >>> For saveAsTable, the table created is not a Hive table, but a Spark SQL > >>> data source table > >>> ( > http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources > ). > >>> We are only using Hive's metastore to store the metadata (to be > specific, > >>> only table properties and serde properties). When you look at table > >>> property, there will be a field called "spark.sql.sources.provider" > and the > >>> value will be "org.apache.spark.sql.parquet.DefaultSource". You can > also > >>> look at your files in the file system. They are stored by Parquet. > >>> > >>> Thanks, > >>> > >>> Yin > >>> > >>> On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez <christ...@svds.com> > >>> wrote: > >>>> > >>>> Hi all, > >>>> > >>>> DataFrame.saveAsTable creates a managed table in Hive (v0.13 on > >>>> CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong* > >>>> schema _and_ storage format in the Hive metastore, so that the table > >>>> cannot be read from inside Hive. Spark itself can read the table, but > >>>> Hive throws a Serialization error because it doesn't know it is > >>>> Parquet. > >>>> > >>>> val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education", > >>>> "income") > >>>> df.saveAsTable("spark_test_foo") > >>>> > >>>> Expected: > >>>> > >>>> COLUMNS( > >>>> education BIGINT, > >>>> income BIGINT > >>>> ) > >>>> > >>>> SerDe Library: > >>>> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > >>>> InputFormat: > >>>> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > >>>> > >>>> Actual: > >>>> > >>>> COLUMNS( > >>>> col array<string> COMMENT "from deserializer" > >>>> ) > >>>> > >>>> SerDe Library: > org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe > >>>> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat > >>>> > >>>> --- > >>>> > >>>> Manually changing schema and storage restores access in Hive and > >>>> doesn't affect Spark. Note also that Hive's table property > >>>> "spark.sql.sources.schema" is correct. At first glance, it looks like > >>>> the schema data is serialized when sent to Hive but not deserialized > >>>> properly on receive. > >>>> > >>>> I'm tracing execution through source code... but before I get any > >>>> deeper, can anyone reproduce this behavior? > >>>> > >>>> Cheers, > >>>> > >>>> Christian > >>>> > >>>> -- > >>>> Christian Perez > >>>> Silicon Valley Data Science > >>>> Data Analyst > >>>> christ...@svds.com > >>>> @cp_phd > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >>>> For additional commands, e-mail: user-h...@spark.apache.org > >>>> > >>> > >> > > > > > > > > -- > > Christian Perez > > Silicon Valley Data Science > > Data Analyst > > christ...@svds.com > > @cp_phd > > > > -- > Christian Perez > Silicon Valley Data Science > Data Analyst > christ...@svds.com > @cp_phd > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >