Since Hive doesn’t support schema evolution, you’ll have to update the schema stored in metastore somehow. For example, you can create a new external table with the merged schema. Say you have a Hive table |t1|:

|CREATE TABLE t1 (c0 INT, c1 DOUBLE); |

By default, this table is stored in HDFS path |hdfs://some-host:9000/user/hive/warehouse/t1|. Now you append some Parquet data with an extra column |c2| to the same directory:

|import org.apache.spark.sql.types._ val path = "hdfs://some-host:9000/user/hive/warehouse/t1" val df1 = sqlContext.range(10).select('id as 'c0, 'id cast DoubleType as 'c1, 'id cast StringType as 'c2) df1.write.mode("append").parquet(path) |

Now you can create a new external table |t2| like this:

|val df2 = sqlContext.read.option("mergeSchema", "true").parquet(path) df2.write.path(path).saveAsTable("t2") |

Since we specified a path above, the newly created |t2| is an external table pointing to the original HDFS location. But the schema of |t2| is the merged version.

The drawback of this approach is that, |t2| is actually a Spark SQL specific data source table rather than a genuine Hive table. This means, it can be accessed by Spark SQL only. We’re just using Hive metastore to help persisting metadata of the data source table. However, since you’re asking how to access the new table via Spark SQL CLI, this should work for you. We are working on making Parquet and ORC data source tables accessible via Hive in Spark 1.5.0.

Cheng

On 7/22/15 10:32 AM, Jerrick Hoang wrote:

Hi Lian,

Sorry I'm new to Spark so I did not express myself very clearly. I'm concerned about the situation when let's say I have a Parquet table some partitions and I add a new column A to parquet schema and write some data with the new schema to a new partition in the table. If i'm not mistaken, if I do a sqlContext.read.parquet(table_path).printSchema() it will print the correct schema with new column A. But if I do a 'describe table' from SparkSQLCLI I won't see the new column being added. I understand that this is because Hive doesn't support schema evolution. So what is the best way to support CLI queries in this situation? Do I need to manually alter the table everytime the underlying schema changes?

Thanks

On Tue, Jul 21, 2015 at 4:37 PM, Cheng Lian <lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>> wrote:

    Hey Jerrick,

    What do you mean by "schema evolution with Hive metastore tables"?
    Hive doesn't take schema evolution into account. Could you please
    give a concrete use case? Are you trying to write Parquet data
    with extra columns into an existing metastore Parquet table?

    Cheng


    On 7/21/15 1:04 AM, Jerrick Hoang wrote:
    I'm new to Spark, any ideas would be much appreciated! Thanks

    On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang
    <jerrickho...@gmail.com <mailto:jerrickho...@gmail.com>> wrote:

        Hi all,

        I'm aware of the support for schema evolution via DataFrame
        API. Just wondering what would be the best way to go about
        dealing with schema evolution with Hive metastore tables. So,
        say I create a table via SparkSQL CLI, how would I deal with
        Parquet schema evolution?

        Thanks,
        J




Reply via email to