Re: Spark-hive parquet schema evolution

2015-07-22 Thread Cheng Lian
Yeah, the benefit of `saveAsTable` is that you don't need to deal with 
schema explicitly, while the benefit of ALTER TABLE is you still have a 
standard vanilla Hive table.


Cheng

On 7/22/15 11:00 PM, Dean Wampler wrote:
While it's not recommended to overwrite files Hive thinks it 
understands, you can add the column to Hive's metastore using an ALTER 
TABLE command using HiveQL in the Hive shell or using HiveContext.sql():


ALTER TABLE mytable ADD COLUMNS col_name data_type

See 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column 
for full details.


dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition 
 (O'Reilly)

Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Wed, Jul 22, 2015 at 4:36 AM, Cheng Lian > wrote:


Since Hive doesn’t support schema evolution, you’ll have to update
the schema stored in metastore somehow. For example, you can
create a new external table with the merged schema. Say you have a
Hive table |t1|:

|CREATE TABLE t1 (c0 INT, c1 DOUBLE); |

By default, this table is stored in HDFS path
|hdfs://some-host:9000/user/hive/warehouse/t1|. Now you append
some Parquet data with an extra column |c2| to the same directory:

|import org.apache.spark.sql.types._ val path =
"hdfs://some-host:9000/user/hive/warehouse/t1" val df1 =
sqlContext.range(10).select('id as 'c0, 'id cast DoubleType as
'c1, 'id cast StringType as 'c2)
df1.write.mode("append").parquet(path) |

Now you can create a new external table |t2| like this:

|val df2 = sqlContext.read.option(" mergeSchema",
"true").parquet(path) df2.write.path(path).saveAsTable("t2") |

Since we specified a path above, the newly created |t2| is an
external table pointing to the original HDFS location. But the
schema of |t2| is the merged version.

The drawback of this approach is that, |t2| is actually a Spark
SQL specific data source table rather than a genuine Hive table.
This means, it can be accessed by Spark SQL only. We’re just using
Hive metastore to help persisting metadata of the data source
table. However, since you’re asking how to access the new table
via Spark SQL CLI, this should work for you. We are working on
making Parquet and ORC data source tables accessible via Hive in
Spark 1.5.0.

Cheng

On 7/22/15 10:32 AM, Jerrick Hoang wrote:


Hi Lian,

Sorry I'm new to Spark so I did not express myself very clearly.
I'm concerned about the situation when let's say I have a Parquet
table some partitions and I add a new column A to parquet schema
and write some data with the new schema to a new partition in the
table. If i'm not mistaken, if I do a
sqlContext.read.parquet(table_path).printSchema() it will print
the correct schema with new column A. But if I do a 'describe
table' from SparkSQLCLI I won't see the new column being added. I
understand that this is because Hive doesn't support schema
evolution. So what is the best way to support CLI queries in this
situation? Do I need to manually alter the table everytime the
underlying schema changes?

Thanks

On Tue, Jul 21, 2015 at 4:37 PM, Cheng Lian
mailto:lian.cs@gmail.com>> wrote:

Hey Jerrick,

What do you mean by "schema evolution with Hive metastore
tables"? Hive doesn't take schema evolution into account.
Could you please give a concrete use case? Are you trying to
write Parquet data with extra columns into an existing
metastore Parquet table?

Cheng


On 7/21/15 1:04 AM, Jerrick Hoang wrote:

I'm new to Spark, any ideas would be much appreciated! Thanks

On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang
mailto:jerrickho...@gmail.com>> wrote:

Hi all,

I'm aware of the support for schema evolution via
DataFrame API. Just wondering what would be the best way
to go about dealing with schema evolution with Hive
metastore tables. So, say I create a table via SparkSQL
CLI, how would I deal with Parquet schema evolution?

Thanks,
J






​






Re: Spark-hive parquet schema evolution

2015-07-22 Thread Dean Wampler
While it's not recommended to overwrite files Hive thinks it understands,
you can add the column to Hive's metastore using an ALTER TABLE command
using HiveQL in the Hive shell or using HiveContext.sql():

ALTER TABLE mytable ADD COLUMNS col_name data_type

See
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column
for full details.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Wed, Jul 22, 2015 at 4:36 AM, Cheng Lian  wrote:

>  Since Hive doesn’t support schema evolution, you’ll have to update the
> schema stored in metastore somehow. For example, you can create a new
> external table with the merged schema. Say you have a Hive table t1:
>
> CREATE TABLE t1 (c0 INT, c1 DOUBLE);
>
> By default, this table is stored in HDFS path
> hdfs://some-host:9000/user/hive/warehouse/t1. Now you append some Parquet
> data with an extra column c2 to the same directory:
>
> import org.apache.spark.sql.types._
> val path = "hdfs://some-host:9000/user/hive/warehouse/t1"val df1 = 
> sqlContext.range(10).select('id as 'c0, 'id cast DoubleType as 'c1, 'id cast 
> StringType as 'c2)
> df1.write.mode("append").parquet(path)
>
> Now you can create a new external table t2 like this:
>
> val df2 = sqlContext.read.option(
>  "
> mergeSchema", "true").parquet(path)
> df2.write.path(path).saveAsTable("t2")
>
> Since we specified a path above, the newly created t2 is an external
> table pointing to the original HDFS location. But the schema of t2 is the
> merged version.
>
> The drawback of this approach is that, t2 is actually a Spark SQL
> specific data source table rather than a genuine Hive table. This means, it
> can be accessed by Spark SQL only. We’re just using Hive metastore to help
> persisting metadata of the data source table. However, since you’re asking
> how to access the new table via Spark SQL CLI, this should work for you. We
> are working on making Parquet and ORC data source tables accessible via
> Hive in Spark 1.5.0.
>
> Cheng
>
> On 7/22/15 10:32 AM, Jerrick Hoang wrote:
>
>   Hi Lian,
>
>  Sorry I'm new to Spark so I did not express myself very clearly. I'm
> concerned about the situation when let's say I have a Parquet table some
> partitions and I add a new column A to parquet schema and write some data
> with the new schema to a new partition in the table. If i'm not mistaken,
> if I do a sqlContext.read.parquet(table_path).printSchema() it will print
> the correct schema with new column A. But if I do a 'describe table' from
> SparkSQLCLI I won't see the new column being added. I understand that this
> is because Hive doesn't support schema evolution. So what is the best way
> to support CLI queries in this situation? Do I need to manually alter the
> table everytime the underlying schema changes?
>
>  Thanks
>
> On Tue, Jul 21, 2015 at 4:37 PM, Cheng Lian  wrote:
>
>>  Hey Jerrick,
>>
>> What do you mean by "schema evolution with Hive metastore tables"? Hive
>> doesn't take schema evolution into account. Could you please give a
>> concrete use case? Are you trying to write Parquet data with extra columns
>> into an existing metastore Parquet table?
>>
>> Cheng
>>
>>
>> On 7/21/15 1:04 AM, Jerrick Hoang wrote:
>>
>> I'm new to Spark, any ideas would be much appreciated! Thanks
>>
>> On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang <
>> jerrickho...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>>  I'm aware of the support for schema evolution via DataFrame API. Just
>>> wondering what would be the best way to go about dealing with schema
>>> evolution with Hive metastore tables. So, say I create a table via SparkSQL
>>> CLI, how would I deal with Parquet schema evolution?
>>>
>>>  Thanks,
>>> J
>>>
>>
>>
>>
>​
>


Re: Spark-hive parquet schema evolution

2015-07-22 Thread Cheng Lian
Since Hive doesn’t support schema evolution, you’ll have to update the 
schema stored in metastore somehow. For example, you can create a new 
external table with the merged schema. Say you have a Hive table |t1|:


|CREATE TABLE t1 (c0 INT, c1 DOUBLE); |

By default, this table is stored in HDFS path 
|hdfs://some-host:9000/user/hive/warehouse/t1|. Now you append some 
Parquet data with an extra column |c2| to the same directory:


|import org.apache.spark.sql.types._ val path = 
"hdfs://some-host:9000/user/hive/warehouse/t1" val df1 = 
sqlContext.range(10).select('id as 'c0, 'id cast DoubleType as 'c1, 'id 
cast StringType as 'c2) df1.write.mode("append").parquet(path) |


Now you can create a new external table |t2| like this:

|val df2 = sqlContext.read.option("mergeSchema", "true").parquet(path) 
df2.write.path(path).saveAsTable("t2") |


Since we specified a path above, the newly created |t2| is an external 
table pointing to the original HDFS location. But the schema of |t2| is 
the merged version.


The drawback of this approach is that, |t2| is actually a Spark SQL 
specific data source table rather than a genuine Hive table. This means, 
it can be accessed by Spark SQL only. We’re just using Hive metastore to 
help persisting metadata of the data source table. However, since you’re 
asking how to access the new table via Spark SQL CLI, this should work 
for you. We are working on making Parquet and ORC data source tables 
accessible via Hive in Spark 1.5.0.


Cheng

On 7/22/15 10:32 AM, Jerrick Hoang wrote:


Hi Lian,

Sorry I'm new to Spark so I did not express myself very clearly. I'm 
concerned about the situation when let's say I have a Parquet table 
some partitions and I add a new column A to parquet schema and write 
some data with the new schema to a new partition in the table. If i'm 
not mistaken, if I do a 
sqlContext.read.parquet(table_path).printSchema() it will print the 
correct schema with new column A. But if I do a 'describe table' from 
SparkSQLCLI I won't see the new column being added. I understand that 
this is because Hive doesn't support schema evolution. So what is the 
best way to support CLI queries in this situation? Do I need to 
manually alter the table everytime the underlying schema changes?


Thanks

On Tue, Jul 21, 2015 at 4:37 PM, Cheng Lian > wrote:


Hey Jerrick,

What do you mean by "schema evolution with Hive metastore tables"?
Hive doesn't take schema evolution into account. Could you please
give a concrete use case? Are you trying to write Parquet data
with extra columns into an existing metastore Parquet table?

Cheng


On 7/21/15 1:04 AM, Jerrick Hoang wrote:

I'm new to Spark, any ideas would be much appreciated! Thanks

On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang
mailto:jerrickho...@gmail.com>> wrote:

Hi all,

I'm aware of the support for schema evolution via DataFrame
API. Just wondering what would be the best way to go about
dealing with schema evolution with Hive metastore tables. So,
say I create a table via SparkSQL CLI, how would I deal with
Parquet schema evolution?

Thanks,
J






​


Re: Spark-hive parquet schema evolution

2015-07-21 Thread Jerrick Hoang
Hi Lian,

Sorry I'm new to Spark so I did not express myself very clearly. I'm
concerned about the situation when let's say I have a Parquet table some
partitions and I add a new column A to parquet schema and write some data
with the new schema to a new partition in the table. If i'm not mistaken,
if I do a sqlContext.read.parquet(table_path).printSchema() it will print
the correct schema with new column A. But if I do a 'describe table' from
SparkSQLCLI I won't see the new column being added. I understand that this
is because Hive doesn't support schema evolution. So what is the best way
to support CLI queries in this situation? Do I need to manually alter the
table everytime the underlying schema changes?

Thanks

On Tue, Jul 21, 2015 at 4:37 PM, Cheng Lian  wrote:

>  Hey Jerrick,
>
> What do you mean by "schema evolution with Hive metastore tables"? Hive
> doesn't take schema evolution into account. Could you please give a
> concrete use case? Are you trying to write Parquet data with extra columns
> into an existing metastore Parquet table?
>
> Cheng
>
>
> On 7/21/15 1:04 AM, Jerrick Hoang wrote:
>
> I'm new to Spark, any ideas would be much appreciated! Thanks
>
> On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang 
> wrote:
>
>> Hi all,
>>
>>  I'm aware of the support for schema evolution via DataFrame API. Just
>> wondering what would be the best way to go about dealing with schema
>> evolution with Hive metastore tables. So, say I create a table via SparkSQL
>> CLI, how would I deal with Parquet schema evolution?
>>
>>  Thanks,
>> J
>>
>
>
>


Re: Spark-hive parquet schema evolution

2015-07-21 Thread Cheng Lian

Hey Jerrick,

What do you mean by "schema evolution with Hive metastore tables"? Hive 
doesn't take schema evolution into account. Could you please give a 
concrete use case? Are you trying to write Parquet data with extra 
columns into an existing metastore Parquet table?


Cheng

On 7/21/15 1:04 AM, Jerrick Hoang wrote:

I'm new to Spark, any ideas would be much appreciated! Thanks

On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang 
mailto:jerrickho...@gmail.com>> wrote:


Hi all,

I'm aware of the support for schema evolution via DataFrame API.
Just wondering what would be the best way to go about dealing with
schema evolution with Hive metastore tables. So, say I create a
table via SparkSQL CLI, how would I deal with Parquet schema
evolution?

Thanks,
J






Re: Spark-hive parquet schema evolution

2015-07-20 Thread Jerrick Hoang
I'm new to Spark, any ideas would be much appreciated! Thanks

On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang 
wrote:

> Hi all,
>
> I'm aware of the support for schema evolution via DataFrame API. Just
> wondering what would be the best way to go about dealing with schema
> evolution with Hive metastore tables. So, say I create a table via SparkSQL
> CLI, how would I deal with Parquet schema evolution?
>
> Thanks,
> J
>


Spark-hive parquet schema evolution

2015-07-18 Thread Jerrick Hoang
Hi all,

I'm aware of the support for schema evolution via DataFrame API. Just
wondering what would be the best way to go about dealing with schema
evolution with Hive metastore tables. So, say I create a table via SparkSQL
CLI, how would I deal with Parquet schema evolution?

Thanks,
J