Re: saveAsTable broken in v1.3 DataFrames?

2015-03-21 Thread Michael Armbrust
I believe that you can get what you want by using HiveQL instead of the
pure programatic API.  This is a little verbose so perhaps a specialized
function would also be useful here.  I'm not sure I would call it
saveAsExternalTable as there are also external spark sql data source
tables that have nothing to do with hive.

The following should create a proper hive table:
df.registerTempTable(df)
sqlContext.sql(CREATE TABLE newTable AS SELECT * FROM df)

At the very least we should clarify in the documentation to avoid future
confusion.  The piggybacking is a little unfortunate but also gives us a
lot of new functionality that we can't get when strictly following the way
that Hive expects tables to be formatted.

I'd suggest opening a JIRA for the specialized method you describe.  Feel
free to mention me and Yin in a comment when create you it.

On Fri, Mar 20, 2015 at 12:55 PM, Christian Perez christ...@svds.com
wrote:

 Any other users interested in a feature
 DataFrame.saveAsExternalTable() for making _useful_ external tables in
 Hive, or am I the only one? Bueller? If I start a PR for this, will it
 be taken seriously?

 On Thu, Mar 19, 2015 at 9:34 AM, Christian Perez christ...@svds.com
 wrote:
  Hi Yin,
 
  Thanks for the clarification. My first reaction is that if this is the
  intended behavior, it is a wasted opportunity. Why create a managed
  table in Hive that cannot be read from inside Hive? I think I
  understand now that you are essentially piggybacking on Hive's
  metastore to persist table info between/across sessions, but I imagine
  others might expect more (as I have.)
 
  We find ourselves wanting to do work in Spark and persist the results
  where other users (e.g. analysts using Tableau connected to
  Hive/Impala) can explore it. I imagine this is very common. I can, of
  course, save it as parquet and create an external table in hive (which
  I will do now), but saveAsTable seems much less useful to me now.
 
  Any other opinions?
 
  Cheers,
 
  C
 
  On Thu, Mar 19, 2015 at 9:18 AM, Yin Huai yh...@databricks.com wrote:
  I meant table properties and serde properties are used to store
 metadata of
  a Spark SQL data source table. We do not set other fields like SerDe
 lib.
  For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source
 table
  should not show unrelated stuff like Serde lib and InputFormat. I have
  created https://issues.apache.org/jira/browse/SPARK-6413 to track the
  improvement on the output of DESCRIBE statement.
 
  On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai yh...@databricks.com
 wrote:
 
  Hi Christian,
 
  Your table is stored correctly in Parquet format.
 
  For saveAsTable, the table created is not a Hive table, but a Spark SQL
  data source table
  (
 http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources
 ).
  We are only using Hive's metastore to store the metadata (to be
 specific,
  only table properties and serde properties). When you look at table
  property, there will be a field called spark.sql.sources.provider
 and the
  value will be org.apache.spark.sql.parquet.DefaultSource. You can
 also
  look at your files in the file system. They are stored by Parquet.
 
  Thanks,
 
  Yin
 
  On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez christ...@svds.com
  wrote:
 
  Hi all,
 
  DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
  CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
  schema _and_ storage format in the Hive metastore, so that the table
  cannot be read from inside Hive. Spark itself can read the table, but
  Hive throws a Serialization error because it doesn't know it is
  Parquet.
 
  val df = sc.parallelize( Array((1,2), (3,4)) ).toDF(education,
  income)
  df.saveAsTable(spark_test_foo)
 
  Expected:
 
  COLUMNS(
education BIGINT,
income BIGINT
  )
 
  SerDe Library:
  org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
  InputFormat:
  org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
 
  Actual:
 
  COLUMNS(
col arraystring COMMENT from deserializer
  )
 
  SerDe Library:
 org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
  InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
 
  ---
 
  Manually changing schema and storage restores access in Hive and
  doesn't affect Spark. Note also that Hive's table property
  spark.sql.sources.schema is correct. At first glance, it looks like
  the schema data is serialized when sent to Hive but not deserialized
  properly on receive.
 
  I'm tracing execution through source code... but before I get any
  deeper, can anyone reproduce this behavior?
 
  Cheers,
 
  Christian
 
  --
  Christian Perez
  Silicon Valley Data Science
  Data Analyst
  christ...@svds.com
  @cp_phd
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 
 
 
 
  --
  

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-20 Thread Christian Perez
Any other users interested in a feature
DataFrame.saveAsExternalTable() for making _useful_ external tables in
Hive, or am I the only one? Bueller? If I start a PR for this, will it
be taken seriously?

On Thu, Mar 19, 2015 at 9:34 AM, Christian Perez christ...@svds.com wrote:
 Hi Yin,

 Thanks for the clarification. My first reaction is that if this is the
 intended behavior, it is a wasted opportunity. Why create a managed
 table in Hive that cannot be read from inside Hive? I think I
 understand now that you are essentially piggybacking on Hive's
 metastore to persist table info between/across sessions, but I imagine
 others might expect more (as I have.)

 We find ourselves wanting to do work in Spark and persist the results
 where other users (e.g. analysts using Tableau connected to
 Hive/Impala) can explore it. I imagine this is very common. I can, of
 course, save it as parquet and create an external table in hive (which
 I will do now), but saveAsTable seems much less useful to me now.

 Any other opinions?

 Cheers,

 C

 On Thu, Mar 19, 2015 at 9:18 AM, Yin Huai yh...@databricks.com wrote:
 I meant table properties and serde properties are used to store metadata of
 a Spark SQL data source table. We do not set other fields like SerDe lib.
 For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source table
 should not show unrelated stuff like Serde lib and InputFormat. I have
 created https://issues.apache.org/jira/browse/SPARK-6413 to track the
 improvement on the output of DESCRIBE statement.

 On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai yh...@databricks.com wrote:

 Hi Christian,

 Your table is stored correctly in Parquet format.

 For saveAsTable, the table created is not a Hive table, but a Spark SQL
 data source table
 (http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources).
 We are only using Hive's metastore to store the metadata (to be specific,
 only table properties and serde properties). When you look at table
 property, there will be a field called spark.sql.sources.provider and the
 value will be org.apache.spark.sql.parquet.DefaultSource. You can also
 look at your files in the file system. They are stored by Parquet.

 Thanks,

 Yin

 On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez christ...@svds.com
 wrote:

 Hi all,

 DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
 CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
 schema _and_ storage format in the Hive metastore, so that the table
 cannot be read from inside Hive. Spark itself can read the table, but
 Hive throws a Serialization error because it doesn't know it is
 Parquet.

 val df = sc.parallelize( Array((1,2), (3,4)) ).toDF(education,
 income)
 df.saveAsTable(spark_test_foo)

 Expected:

 COLUMNS(
   education BIGINT,
   income BIGINT
 )

 SerDe Library:
 org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
 InputFormat:
 org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat

 Actual:

 COLUMNS(
   col arraystring COMMENT from deserializer
 )

 SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
 InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat

 ---

 Manually changing schema and storage restores access in Hive and
 doesn't affect Spark. Note also that Hive's table property
 spark.sql.sources.schema is correct. At first glance, it looks like
 the schema data is serialized when sent to Hive but not deserialized
 properly on receive.

 I'm tracing execution through source code... but before I get any
 deeper, can anyone reproduce this behavior?

 Cheers,

 Christian

 --
 Christian Perez
 Silicon Valley Data Science
 Data Analyst
 christ...@svds.com
 @cp_phd

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






 --
 Christian Perez
 Silicon Valley Data Science
 Data Analyst
 christ...@svds.com
 @cp_phd



-- 
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Yin Huai
I meant table properties and serde properties are used to store metadata of
a Spark SQL data source table. We do not set other fields like SerDe lib.
For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source
table should not show unrelated stuff like Serde lib and InputFormat. I
have created https://issues.apache.org/jira/browse/SPARK-6413 to track the
improvement on the output of DESCRIBE statement.

On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai yh...@databricks.com wrote:

 Hi Christian,

 Your table is stored correctly in Parquet format.

 For saveAsTable, the table created is *not* a Hive table, but a Spark SQL
 data source table (
 http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources).
 We are only using Hive's metastore to store the metadata (to be specific,
 only table properties and serde properties). When you look at table
 property, there will be a field called spark.sql.sources.provider and the
 value will be org.apache.spark.sql.parquet.DefaultSource. You can also
 look at your files in the file system. They are stored by Parquet.

 Thanks,

 Yin

 On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez christ...@svds.com
 wrote:

 Hi all,

 DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
 CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
 schema _and_ storage format in the Hive metastore, so that the table
 cannot be read from inside Hive. Spark itself can read the table, but
 Hive throws a Serialization error because it doesn't know it is
 Parquet.

 val df = sc.parallelize( Array((1,2), (3,4)) ).toDF(education, income)
 df.saveAsTable(spark_test_foo)

 Expected:

 COLUMNS(
   education BIGINT,
   income BIGINT
 )

 SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
 InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat

 Actual:

 COLUMNS(
   col arraystring COMMENT from deserializer
 )

 SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
 InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat

 ---

 Manually changing schema and storage restores access in Hive and
 doesn't affect Spark. Note also that Hive's table property
 spark.sql.sources.schema is correct. At first glance, it looks like
 the schema data is serialized when sent to Hive but not deserialized
 properly on receive.

 I'm tracing execution through source code... but before I get any
 deeper, can anyone reproduce this behavior?

 Cheers,

 Christian

 --
 Christian Perez
 Silicon Valley Data Science
 Data Analyst
 christ...@svds.com
 @cp_phd

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Yin Huai
Hi Christian,

Your table is stored correctly in Parquet format.

For saveAsTable, the table created is *not* a Hive table, but a Spark SQL
data source table (
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources).
We are only using Hive's metastore to store the metadata (to be specific,
only table properties and serde properties). When you look at table
property, there will be a field called spark.sql.sources.provider and the
value will be org.apache.spark.sql.parquet.DefaultSource. You can also
look at your files in the file system. They are stored by Parquet.

Thanks,

Yin

On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez christ...@svds.com
wrote:

 Hi all,

 DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
 CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
 schema _and_ storage format in the Hive metastore, so that the table
 cannot be read from inside Hive. Spark itself can read the table, but
 Hive throws a Serialization error because it doesn't know it is
 Parquet.

 val df = sc.parallelize( Array((1,2), (3,4)) ).toDF(education, income)
 df.saveAsTable(spark_test_foo)

 Expected:

 COLUMNS(
   education BIGINT,
   income BIGINT
 )

 SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
 InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat

 Actual:

 COLUMNS(
   col arraystring COMMENT from deserializer
 )

 SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
 InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat

 ---

 Manually changing schema and storage restores access in Hive and
 doesn't affect Spark. Note also that Hive's table property
 spark.sql.sources.schema is correct. At first glance, it looks like
 the schema data is serialized when sent to Hive but not deserialized
 properly on receive.

 I'm tracing execution through source code... but before I get any
 deeper, can anyone reproduce this behavior?

 Cheers,

 Christian

 --
 Christian Perez
 Silicon Valley Data Science
 Data Analyst
 christ...@svds.com
 @cp_phd

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Christian Perez
Hi Yin,

Thanks for the clarification. My first reaction is that if this is the
intended behavior, it is a wasted opportunity. Why create a managed
table in Hive that cannot be read from inside Hive? I think I
understand now that you are essentially piggybacking on Hive's
metastore to persist table info between/across sessions, but I imagine
others might expect more (as I have.)

We find ourselves wanting to do work in Spark and persist the results
where other users (e.g. analysts using Tableau connected to
Hive/Impala) can explore it. I imagine this is very common. I can, of
course, save it as parquet and create an external table in hive (which
I will do now), but saveAsTable seems much less useful to me now.

Any other opinions?

Cheers,

C

On Thu, Mar 19, 2015 at 9:18 AM, Yin Huai yh...@databricks.com wrote:
 I meant table properties and serde properties are used to store metadata of
 a Spark SQL data source table. We do not set other fields like SerDe lib.
 For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source table
 should not show unrelated stuff like Serde lib and InputFormat. I have
 created https://issues.apache.org/jira/browse/SPARK-6413 to track the
 improvement on the output of DESCRIBE statement.

 On Thu, Mar 19, 2015 at 12:11 PM, Yin Huai yh...@databricks.com wrote:

 Hi Christian,

 Your table is stored correctly in Parquet format.

 For saveAsTable, the table created is not a Hive table, but a Spark SQL
 data source table
 (http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources).
 We are only using Hive's metastore to store the metadata (to be specific,
 only table properties and serde properties). When you look at table
 property, there will be a field called spark.sql.sources.provider and the
 value will be org.apache.spark.sql.parquet.DefaultSource. You can also
 look at your files in the file system. They are stored by Parquet.

 Thanks,

 Yin

 On Thu, Mar 19, 2015 at 12:00 PM, Christian Perez christ...@svds.com
 wrote:

 Hi all,

 DataFrame.saveAsTable creates a managed table in Hive (v0.13 on
 CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong*
 schema _and_ storage format in the Hive metastore, so that the table
 cannot be read from inside Hive. Spark itself can read the table, but
 Hive throws a Serialization error because it doesn't know it is
 Parquet.

 val df = sc.parallelize( Array((1,2), (3,4)) ).toDF(education,
 income)
 df.saveAsTable(spark_test_foo)

 Expected:

 COLUMNS(
   education BIGINT,
   income BIGINT
 )

 SerDe Library:
 org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
 InputFormat:
 org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat

 Actual:

 COLUMNS(
   col arraystring COMMENT from deserializer
 )

 SerDe Library: org.apache.hadoop.hive.serd2.MetadataTypedColumnsetSerDe
 InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat

 ---

 Manually changing schema and storage restores access in Hive and
 doesn't affect Spark. Note also that Hive's table property
 spark.sql.sources.schema is correct. At first glance, it looks like
 the schema data is serialized when sent to Hive but not deserialized
 properly on receive.

 I'm tracing execution through source code... but before I get any
 deeper, can anyone reproduce this behavior?

 Cheers,

 Christian

 --
 Christian Perez
 Silicon Valley Data Science
 Data Analyst
 christ...@svds.com
 @cp_phd

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






-- 
Christian Perez
Silicon Valley Data Science
Data Analyst
christ...@svds.com
@cp_phd

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org