Hi Christian,

Your table is stored correctly in Parquet format.

For saveAsTable, the table created is *not* a Hive table, but a Spark SQL
data source table (
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources).
We are only using Hive's metastore to store the metadata (to be specific,
only table properties and serde properties). When you look at table
property, there will be a field called "spark.sql.sources.provider" and the
value will be "org.apache.spark.sql.parquet.DefaultSource". You can also
look at your files in the file system. They are stored by Parquet.

Thanks,

Yin

On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez <christ...@svds.com>
wrote:

> Hi all,
>
> DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
> CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
> schema _and_ storage format in the Hive metastore, so that the table
> cannot be read from inside Hive. Spark itself can read the table, but
> Hive throws a Serialization error because it doesn't know it is
> Parquet.
>
> val df = sc.parallelize( Array((1,2), (3,4)) ).toDF("education", "income")
> df.saveAsTable("spark_test_foo")
>
> Expected:
>
> COLUMNS(
>   education BIGINT,
>   income BIGINT
> )
>
> SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
>
> Actual:
>
> COLUMNS(
>   col array<string> COMMENT "from deserializer"
> )
>
> SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
> InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
>
> ---
>
> Manually changing schema and storage restores access in Hive and
> doesn't affect Spark. Note also that Hive's table property
> "spark.sql.sources.schema" is correct. At first glance, it looks like
> the schema data is serialized when sent to Hive but not deserialized
> properly on receive.
>
> I'm tracing execution through source code... but before I get any
> deeper, can anyone reproduce this behavior?
>
> Cheers,
>
> Christian
>
> --
> Christian Perez
> Silicon Valley Data Science
> Data Analyst
> christ...@svds.com
> @cp_phd
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to