Re: saveAsTable broken in v1.3 DataFrames?

Yin Huai Thu, 19 Mar 2015 09:20:16 -0700

I meant table properties and serde properties are used to store metadata of
a Spark SQL data source table. We do not set other fields like SerDe lib.
For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source
table should not show unrelated stuff like Serde lib and InputFormat. I
have created https://issues.apache.org/jira/browse/SPARK-6413 to track the
improvement on the output of DESCRIBE statement.


On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai <yh...@databricks.com> wrote:

> Hi Christian,
>
> Your table is stored correctly in Parquet format.
>
> For saveAsTable, the table created is *not* a Hive table, but a Spark SQL
> data source table (
> http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources).
> We are only using Hive's metastore to store the metadata (to be specific,
> only table properties and serde properties). When you look at table
> property, there will be a field called "spark.sql.sources.provider" and the
> value will be "org.apache.spark.sql.parquet.DefaultSource". You can also
> look at your files in the file system. They are stored by Parquet.
>
> Thanks,
>
> Yin
>
> On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez <christ...@svds.com>
> wrote:
>
>> Hi all,
>>
>> DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
>> CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
>> schema _and_ storage format in the Hive metastore, so that the table
>> cannot be read from inside Hive. Spark itself can read the table, but
>> Hive throws a Serialization error because it doesn't know it is
>> Parquet.
>>
>> val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education", "income")
>> df.saveAsTable("spark_test_foo")
>>
>> Expected:
>>
>> COLUMNS(
>>   education BIGINT,
>>   income BIGINT
>> )
>>
>> SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
>> InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
>>
>> Actual:
>>
>> COLUMNS(
>>   col array<string> COMMENT "from deserializer"
>> )
>>
>> SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
>> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
>>
>> ---
>>
>> Manually changing schema and storage restores access in Hive and
>> doesn't affect Spark. Note also that Hive's table property
>> "spark.sql.sources.schema" is correct. At first glance, it looks like
>> the schema data is serialized when sent to Hive but not deserialized
>> properly on receive.
>>
>> I'm tracing execution through source code... but before I get any
>> deeper, can anyone reproduce this behavior?
>>
>> Cheers,
>>
>> Christian
>>
>> --
>> Christian Perez
>> Silicon Valley Data Science
>> Data Analyst
>> christ...@svds.com
>> @cp_phd
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: saveAsTable broken in v1.3 DataFrames?

Reply via email to