Any other users interested in a feature
DataFrame.saveAsExternalTable() for making _useful_ external tables in
Hive, or am I the only one? Bueller? If I start a PR for this, will it
be taken seriously?

On Thu, Mar 19, 2015 at 9:34 AM, Christian Perez <christ...@svds.com> wrote:
> Hi Yin,
>
> Thanks for the clarification. My first reaction is that if this is the
> intended behavior, it is a wasted opportunity. Why create a managed
> table in Hive that cannot be read from inside Hive? I think I
> understand now that you are essentially piggybacking on Hive's
> metastore to persist table info between/across sessions, but I imagine
> others might expect more (as I have.)
>
> We find ourselves wanting to do work in Spark and persist the results
> where other users (e.g. analysts using Tableau connected to
> Hive/Impala) can explore it. I imagine this is very common. I can, of
> course, save it as parquet and create an external table in hive (which
> I will do now), but saveAsTable seems much less useful to me now.
>
> Any other opinions?
>
> Cheers,
>
> C
>
> On Thu, Mar 19, 2015 at 9:18 AM, Yin Huai <yh...@databricks.com> wrote:
>> I meant table properties and serde properties are used to store metadata of
>> a Spark SQL data source table. We do not set other fields like SerDe lib.
>> For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source table
>> should not show unrelated stuff like Serde lib and InputFormat. I have
>> created https://issues.apache.org/jira/browse/SPARK-6413 to track the
>> improvement on the output of DESCRIBE statement.
>>
>> On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai <yh...@databricks.com> wrote:
>>>
>>> Hi Christian,
>>>
>>> Your table is stored correctly in Parquet format.
>>>
>>> For saveAsTable, the table created is not a Hive table, but a Spark SQL
>>> data source table
>>> (http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources).
>>> We are only using Hive's metastore to store the metadata (to be specific,
>>> only table properties and serde properties). When you look at table
>>> property, there will be a field called "spark.sql.sources.provider" and the
>>> value will be "org.apache.spark.sql.parquet.DefaultSource". You can also
>>> look at your files in the file system. They are stored by Parquet.
>>>
>>> Thanks,
>>>
>>> Yin
>>>
>>> On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez <christ...@svds.com>
>>> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
>>>> CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
>>>> schema _and_ storage format in the Hive metastore, so that the table
>>>> cannot be read from inside Hive. Spark itself can read the table, but
>>>> Hive throws a Serialization error because it doesn't know it is
>>>> Parquet.
>>>>
>>>> val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education",
>>>> "income")
>>>> df.saveAsTable("spark_test_foo")
>>>>
>>>> Expected:
>>>>
>>>> COLUMNS(
>>>>   education BIGINT,
>>>>   income BIGINT
>>>> )
>>>>
>>>> SerDe Library:
>>>> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
>>>> InputFormat:
>>>> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
>>>>
>>>> Actual:
>>>>
>>>> COLUMNS(
>>>>   col array<string> COMMENT "from deserializer"
>>>> )
>>>>
>>>> SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
>>>> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
>>>>
>>>> ---
>>>>
>>>> Manually changing schema and storage restores access in Hive and
>>>> doesn't affect Spark. Note also that Hive's table property
>>>> "spark.sql.sources.schema" is correct. At first glance, it looks like
>>>> the schema data is serialized when sent to Hive but not deserialized
>>>> properly on receive.
>>>>
>>>> I'm tracing execution through source code... but before I get any
>>>> deeper, can anyone reproduce this behavior?
>>>>
>>>> Cheers,
>>>>
>>>> Christian
>>>>
>>>> --
>>>> Christian Perez
>>>> Silicon Valley Data Science
>>>> Data Analyst
>>>> christ...@svds.com
>>>> @cp_phd
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>
>>
>
>
>
> --
> Christian Perez
> Silicon Valley Data Science
> Data Analyst
> christ...@svds.com
> @cp_phd



-- 
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to