Re: saveAsTable broken in v1.3 DataFrames?

Michael Armbrust Sat, 21 Mar 2015 14:14:03 -0700

I believe that you can get what you want by using HiveQL instead of the
pure programatic API.  This is a little verbose so perhaps a specialized
function would also be useful here.  I'm not sure I would call it
saveAsExternalTable as there are also "external" spark sql data source
tables that have nothing to do with hive.


The following should create a proper hive table:
df.registerTempTable("df")
sqlContext.sql("CREATE TABLE newTable AS SELECT * FROM df")

At the very least we should clarify in the documentation to avoid future
confusion.  The piggybacking is a little unfortunate but also gives us a
lot of new functionality that we can't get when strictly following the way
that Hive expects tables to be formatted.

I'd suggest opening a JIRA for the specialized method you describe.  Feel
free to mention me and Yin in a comment when create you it.

On Fri, Mar 20, 2015 at 12:55 PM, Christian Perez <christ...@svds.com>
wrote:

> Any other users interested in a feature
> DataFrame.saveAsExternalTable() for making _useful_ external tables in
> Hive, or am I the only one? Bueller? If I start a PR for this, will it
> be taken seriously?
>
> On Thu, Mar 19, 2015 at 9:34 AM, Christian Perez <christ...@svds.com>
> wrote:
> > Hi Yin,
> >
> > Thanks for the clarification. My first reaction is that if this is the
> > intended behavior, it is a wasted opportunity. Why create a managed
> > table in Hive that cannot be read from inside Hive? I think I
> > understand now that you are essentially piggybacking on Hive's
> > metastore to persist table info between/across sessions, but I imagine
> > others might expect more (as I have.)
> >
> > We find ourselves wanting to do work in Spark and persist the results
> > where other users (e.g. analysts using Tableau connected to
> > Hive/Impala) can explore it. I imagine this is very common. I can, of
> > course, save it as parquet and create an external table in hive (which
> > I will do now), but saveAsTable seems much less useful to me now.
> >
> > Any other opinions?
> >
> > Cheers,
> >
> > C
> >
> > On Thu, Mar 19, 2015 at 9:18 AM, Yin Huai <yh...@databricks.com> wrote:
> >> I meant table properties and serde properties are used to store
> metadata of
> >> a Spark SQL data source table. We do not set other fields like SerDe
> lib.
> >> For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source
> table
> >> should not show unrelated stuff like Serde lib and InputFormat. I have
> >> created https://issues.apache.org/jira/browse/SPARK-6413 to track the
> >> improvement on the output of DESCRIBE statement.
> >>
> >> On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai <yh...@databricks.com>
> wrote:
> >>>
> >>> Hi Christian,
> >>>
> >>> Your table is stored correctly in Parquet format.
> >>>
> >>> For saveAsTable, the table created is not a Hive table, but a Spark SQL
> >>> data source table
> >>> (
> http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources
> ).
> >>> We are only using Hive's metastore to store the metadata (to be
> specific,
> >>> only table properties and serde properties). When you look at table
> >>> property, there will be a field called "spark.sql.sources.provider"
> and the
> >>> value will be "org.apache.spark.sql.parquet.DefaultSource". You can
> also
> >>> look at your files in the file system. They are stored by Parquet.
> >>>
> >>> Thanks,
> >>>
> >>> Yin
> >>>
> >>> On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez <christ...@svds.com>
> >>> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
> >>>> CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
> >>>> schema _and_ storage format in the Hive metastore, so that the table
> >>>> cannot be read from inside Hive. Spark itself can read the table, but
> >>>> Hive throws a Serialization error because it doesn't know it is
> >>>> Parquet.
> >>>>
> >>>> val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education",
> >>>> "income")
> >>>> df.saveAsTable("spark_test_foo")
> >>>>
> >>>> Expected:
> >>>>
> >>>> COLUMNS(
> >>>>   education BIGINT,
> >>>>   income BIGINT
> >>>> )
> >>>>
> >>>> SerDe Library:
> >>>> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> >>>> InputFormat:
> >>>> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> >>>>
> >>>> Actual:
> >>>>
> >>>> COLUMNS(
> >>>>   col array<string> COMMENT "from deserializer"
> >>>> )
> >>>>
> >>>> SerDe Library:
> org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
> >>>> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
> >>>>
> >>>> ---
> >>>>
> >>>> Manually changing schema and storage restores access in Hive and
> >>>> doesn't affect Spark. Note also that Hive's table property
> >>>> "spark.sql.sources.schema" is correct. At first glance, it looks like
> >>>> the schema data is serialized when sent to Hive but not deserialized
> >>>> properly on receive.
> >>>>
> >>>> I'm tracing execution through source code... but before I get any
> >>>> deeper, can anyone reproduce this behavior?
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Christian
> >>>>
> >>>> --
> >>>> Christian Perez
> >>>> Silicon Valley Data Science
> >>>> Data Analyst
> >>>> christ...@svds.com
> >>>> @cp_phd
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>>
> >>>
> >>
> >
> >
> >
> > --
> > Christian Perez
> > Silicon Valley Data Science
> > Data Analyst
> > christ...@svds.com
> > @cp_phd
>
>
>
> --
> Christian Perez
> Silicon Valley Data Science
> Data Analyst
> christ...@svds.com
> @cp_phd
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: saveAsTable broken in v1.3 DataFrames?

Reply via email to