Since Hive doesn’t support schema evolution, you’ll have to update the
schema stored in metastore somehow. For example, you can create a new
external table with the merged schema. Say you have a Hive table |t1|:
|CREATE TABLE t1 (c0 INT, c1 DOUBLE); |
By default, this table is stored in HDFS path
|hdfs://some-host:9000/user/hive/warehouse/t1|. Now you append some
Parquet data with an extra column |c2| to the same directory:
|import org.apache.spark.sql.types._ val path =
"hdfs://some-host:9000/user/hive/warehouse/t1" val df1 =
sqlContext.range(10).select('id as 'c0, 'id cast DoubleType as 'c1, 'id
cast StringType as 'c2) df1.write.mode("append").parquet(path) |
Now you can create a new external table |t2| like this:
|val df2 = sqlContext.read.option("mergeSchema", "true").parquet(path)
df2.write.path(path).saveAsTable("t2") |
Since we specified a path above, the newly created |t2| is an external
table pointing to the original HDFS location. But the schema of |t2| is
the merged version.
The drawback of this approach is that, |t2| is actually a Spark SQL
specific data source table rather than a genuine Hive table. This means,
it can be accessed by Spark SQL only. We’re just using Hive metastore to
help persisting metadata of the data source table. However, since you’re
asking how to access the new table via Spark SQL CLI, this should work
for you. We are working on making Parquet and ORC data source tables
accessible via Hive in Spark 1.5.0.
Cheng
On 7/22/15 10:32 AM, Jerrick Hoang wrote:
Hi Lian,
Sorry I'm new to Spark so I did not express myself very clearly. I'm
concerned about the situation when let's say I have a Parquet table
some partitions and I add a new column A to parquet schema and write
some data with the new schema to a new partition in the table. If i'm
not mistaken, if I do a
sqlContext.read.parquet(table_path).printSchema() it will print the
correct schema with new column A. But if I do a 'describe table' from
SparkSQLCLI I won't see the new column being added. I understand that
this is because Hive doesn't support schema evolution. So what is the
best way to support CLI queries in this situation? Do I need to
manually alter the table everytime the underlying schema changes?
Thanks
On Tue, Jul 21, 2015 at 4:37 PM, Cheng Lian <lian.cs....@gmail.com
<mailto:lian.cs....@gmail.com>> wrote:
Hey Jerrick,
What do you mean by "schema evolution with Hive metastore tables"?
Hive doesn't take schema evolution into account. Could you please
give a concrete use case? Are you trying to write Parquet data
with extra columns into an existing metastore Parquet table?
Cheng
On 7/21/15 1:04 AM, Jerrick Hoang wrote:
I'm new to Spark, any ideas would be much appreciated! Thanks
On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang
<jerrickho...@gmail.com <mailto:jerrickho...@gmail.com>> wrote:
Hi all,
I'm aware of the support for schema evolution via DataFrame
API. Just wondering what would be the best way to go about
dealing with schema evolution with Hive metastore tables. So,
say I create a table via SparkSQL CLI, how would I deal with
Parquet schema evolution?
Thanks,
J