Re: Spark-hive parquet schema evolution

Cheng Lian Wed, 22 Jul 2015 02:37:10 -0700

Since Hive doesn’t support schema evolution, you’ll have to update theschema stored in metastore somehow. For example, you can create a newexternal table with the merged schema. Say you have a Hive table |t1|:


|CREATE TABLE t1 (c0 INT, c1 DOUBLE); |

By default, this table is stored in HDFS path|hdfs://some-host:9000/user/hive/warehouse/t1|. Now you append someParquet data with an extra column |c2| to the same directory:

|import org.apache.spark.sql.types._ val path ="hdfs://some-host:9000/user/hive/warehouse/t1" val df1 =sqlContext.range(10).select('id as 'c0, 'id cast DoubleType as 'c1, 'idcast StringType as 'c2) df1.write.mode("append").parquet(path) |


Now you can create a new external table |t2| like this:

|val df2 = sqlContext.read.option("mergeSchema", "true").parquet(path)df2.write.path(path).saveAsTable("t2") |

Since we specified a path above, the newly created |t2| is an externaltable pointing to the original HDFS location. But the schema of |t2| isthe merged version.

The drawback of this approach is that, |t2| is actually a Spark SQLspecific data source table rather than a genuine Hive table. This means,it can be accessed by Spark SQL only. We’re just using Hive metastore tohelp persisting metadata of the data source table. However, since you’reasking how to access the new table via Spark SQL CLI, this should workfor you. We are working on making Parquet and ORC data source tablesaccessible via Hive in Spark 1.5.0.


Cheng

On 7/22/15 10:32 AM, Jerrick Hoang wrote:

Hi Lian,
Sorry I'm new to Spark so I did not express myself very clearly. I'mconcerned about the situation when let's say I have a Parquet tablesome partitions and I add a new column A to parquet schema and writesome data with the new schema to a new partition in the table. If i'mnot mistaken, if I do asqlContext.read.parquet(table_path).printSchema() it will print thecorrect schema with new column A. But if I do a 'describe table' fromSparkSQLCLI I won't see the new column being added. I understand thatthis is because Hive doesn't support schema evolution. So what is thebest way to support CLI queries in this situation? Do I need tomanually alter the table everytime the underlying schema changes?
Thanks
On Tue, Jul 21, 2015 at 4:37 PM, Cheng Lian <lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote:
    Hey Jerrick,

    What do you mean by "schema evolution with Hive metastore tables"?
    Hive doesn't take schema evolution into account. Could you please
    give a concrete use case? Are you trying to write Parquet data
    with extra columns into an existing metastore Parquet table?

    Cheng


    On 7/21/15 1:04 AM, Jerrick Hoang wrote:
    I'm new to Spark, any ideas would be much appreciated! Thanks

    On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang
    <jerrickho...@gmail.com <mailto:jerrickho...@gmail.com>> wrote:

        Hi all,

        I'm aware of the support for schema evolution via DataFrame
        API. Just wondering what would be the best way to go about
        dealing with schema evolution with Hive metastore tables. So,
        say I create a table via SparkSQL CLI, how would I deal with
        Parquet schema evolution?

        Thanks,
        J

Re: Spark-hive parquet schema evolution

Reply via email to