Re: Spark-hive parquet schema evolution

Dean Wampler Wed, 22 Jul 2015 08:01:48 -0700

While it's not recommended to overwrite files Hive thinks it understands,
you can add the column to Hive's metastore using an ALTER TABLE command
using HiveQL in the Hive shell or using HiveContext.sql():


ALTER TABLE mytable ADD COLUMNS col_name data_type

See
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column
for full details.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Wed, Jul 22, 2015 at 4:36 AM, Cheng Lian <lian.cs....@gmail.com> wrote:

>  Since Hive doesn’t support schema evolution, you’ll have to update the
> schema stored in metastore somehow. For example, you can create a new
> external table with the merged schema. Say you have a Hive table t1:
>
> CREATE TABLE t1 (c0 INT, c1 DOUBLE);
>
> By default, this table is stored in HDFS path
> hdfs://some-host:9000/user/hive/warehouse/t1. Now you append some Parquet
> data with an extra column c2 to the same directory:
>
> import org.apache.spark.sql.types._
> val path = "hdfs://some-host:9000/user/hive/warehouse/t1"val df1 = 
> sqlContext.range(10).select('id as 'c0, 'id cast DoubleType as 'c1, 'id cast 
> StringType as 'c2)
> df1.write.mode("append").parquet(path)
>
> Now you can create a new external table t2 like this:
>
> val df2 = sqlContext.read.option(
>  "
> mergeSchema", "true").parquet(path)
> df2.write.path(path).saveAsTable("t2")
>
> Since we specified a path above, the newly created t2 is an external
> table pointing to the original HDFS location. But the schema of t2 is the
> merged version.
>
> The drawback of this approach is that, t2 is actually a Spark SQL
> specific data source table rather than a genuine Hive table. This means, it
> can be accessed by Spark SQL only. We’re just using Hive metastore to help
> persisting metadata of the data source table. However, since you’re asking
> how to access the new table via Spark SQL CLI, this should work for you. We
> are working on making Parquet and ORC data source tables accessible via
> Hive in Spark 1.5.0.
>
> Cheng
>
> On 7/22/15 10:32 AM, Jerrick Hoang wrote:
>
>   Hi Lian,
>
>  Sorry I'm new to Spark so I did not express myself very clearly. I'm
> concerned about the situation when let's say I have a Parquet table some
> partitions and I add a new column A to parquet schema and write some data
> with the new schema to a new partition in the table. If i'm not mistaken,
> if I do a sqlContext.read.parquet(table_path).printSchema() it will print
> the correct schema with new column A. But if I do a 'describe table' from
> SparkSQLCLI I won't see the new column being added. I understand that this
> is because Hive doesn't support schema evolution. So what is the best way
> to support CLI queries in this situation? Do I need to manually alter the
> table everytime the underlying schema changes?
>
>  Thanks
>
> On Tue, Jul 21, 2015 at 4:37 PM, Cheng Lian <lian.cs....@gmail.com> wrote:
>
>>  Hey Jerrick,
>>
>> What do you mean by "schema evolution with Hive metastore tables"? Hive
>> doesn't take schema evolution into account. Could you please give a
>> concrete use case? Are you trying to write Parquet data with extra columns
>> into an existing metastore Parquet table?
>>
>> Cheng
>>
>>
>> On 7/21/15 1:04 AM, Jerrick Hoang wrote:
>>
>> I'm new to Spark, any ideas would be much appreciated! Thanks
>>
>> On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang <
>> <jerrickho...@gmail.com>jerrickho...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>>  I'm aware of the support for schema evolution via DataFrame API. Just
>>> wondering what would be the best way to go about dealing with schema
>>> evolution with Hive metastore tables. So, say I create a table via SparkSQL
>>> CLI, how would I deal with Parquet schema evolution?
>>>
>>>  Thanks,
>>> J
>>>
>>
>>
>>
>    
>

Re: Spark-hive parquet schema evolution

Reply via email to