Any other users interested in a feature
DataFrame.saveAsExternalTable() for making _useful_ external tables in
Hive, or am I the only one? Bueller? If I start a PR for this, will it
be taken seriously?

On Thu, Mar 19, 2015 at 9:34 AM, Christian Perez <> wrote:
> Hi Yin,
> Thanks for the clarification. My first reaction is that if this is the
> intended behavior, it is a wasted opportunity. Why create a managed
> table in Hive that cannot be read from inside Hive? I think I
> understand now that you are essentially piggybacking on Hive's
> metastore to persist table info between/across sessions, but I imagine
> others might expect more (as I have.)
> We find ourselves wanting to do work in Spark and persist the results
> where other users (e.g. analysts using Tableau connected to
> Hive/Impala) can explore it. I imagine this is very common. I can, of
> course, save it as parquet and create an external table in hive (which
> I will do now), but saveAsTable seems much less useful to me now.
> Any other opinions?
> Cheers,
> C
> On Thu, Mar 19, 2015 at 9:18 AM, Yin Huai <> wrote:
>> I meant table properties and serde properties are used to store metadata of
>> a Spark SQL data source table. We do not set other fields like SerDe lib.
>> For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source table
>> should not show unrelated stuff like Serde lib and InputFormat. I have
>> created to track the
>> improvement on the output of DESCRIBE statement.
>> On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai <> wrote:
>>> Hi Christian,
>>> Your table is stored correctly in Parquet format.
>>> For saveAsTable, the table created is not a Hive table, but a Spark SQL
>>> data source table
>>> (
>>> We are only using Hive's metastore to store the metadata (to be specific,
>>> only table properties and serde properties). When you look at table
>>> property, there will be a field called "spark.sql.sources.provider" and the
>>> value will be "org.apache.spark.sql.parquet.DefaultSource". You can also
>>> look at your files in the file system. They are stored by Parquet.
>>> Thanks,
>>> Yin
>>> On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez <>
>>> wrote:
>>>> Hi all,
>>>> DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
>>>> CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
>>>> schema _and_ storage format in the Hive metastore, so that the table
>>>> cannot be read from inside Hive. Spark itself can read the table, but
>>>> Hive throws a Serialization error because it doesn't know it is
>>>> Parquet.
>>>> val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education",
>>>> "income")
>>>> df.saveAsTable("spark_test_foo")
>>>> Expected:
>>>>   education BIGINT,
>>>>   income BIGINT
>>>> )
>>>> SerDe Library:
>>>> InputFormat:
>>>> Actual:
>>>>   col array<string> COMMENT "from deserializer"
>>>> )
>>>> SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
>>>> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
>>>> ---
>>>> Manually changing schema and storage restores access in Hive and
>>>> doesn't affect Spark. Note also that Hive's table property
>>>> "spark.sql.sources.schema" is correct. At first glance, it looks like
>>>> the schema data is serialized when sent to Hive but not deserialized
>>>> properly on receive.
>>>> I'm tracing execution through source code... but before I get any
>>>> deeper, can anyone reproduce this behavior?
>>>> Cheers,
>>>> Christian
>>>> --
>>>> Christian Perez
>>>> Silicon Valley Data Science
>>>> Data Analyst
>>>> @cp_phd
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>>>> For additional commands, e-mail:
> --
> Christian Perez
> Silicon Valley Data Science
> Data Analyst
> @cp_phd

Christian Perez
Silicon Valley Data Science
Data Analyst

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to