Yeah, the benefit of `saveAsTable` is that you don't need to deal with
schema explicitly, while the benefit of ALTER TABLE is you still have a
standard vanilla Hive table.
Cheng
On 7/22/15 11:00 PM, Dean Wampler wrote:
While it's not recommended to overwrite files Hive thinks it
understands, you can add the column to Hive's metastore using an ALTER
TABLE command using HiveQL in the Hive shell or using HiveContext.sql():
ALTER TABLE mytable ADD COLUMNS col_name data_type
See
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column
for full details.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com
On Wed, Jul 22, 2015 at 4:36 AM, Cheng Lian <lian.cs....@gmail.com
<mailto:lian.cs....@gmail.com>> wrote:
Since Hive doesn’t support schema evolution, you’ll have to update
the schema stored in metastore somehow. For example, you can
create a new external table with the merged schema. Say you have a
Hive table |t1|:
|CREATE TABLE t1 (c0 INT, c1 DOUBLE); |
By default, this table is stored in HDFS path
|hdfs://some-host:9000/user/hive/warehouse/t1|. Now you append
some Parquet data with an extra column |c2| to the same directory:
|import org.apache.spark.sql.types._ val path =
"hdfs://some-host:9000/user/hive/warehouse/t1" val df1 =
sqlContext.range(10).select('id as 'c0, 'id cast DoubleType as
'c1, 'id cast StringType as 'c2)
df1.write.mode("append").parquet(path) |
Now you can create a new external table |t2| like this:
|val df2 = sqlContext.read.option(" mergeSchema",
"true").parquet(path) df2.write.path(path).saveAsTable("t2") |
Since we specified a path above, the newly created |t2| is an
external table pointing to the original HDFS location. But the
schema of |t2| is the merged version.
The drawback of this approach is that, |t2| is actually a Spark
SQL specific data source table rather than a genuine Hive table.
This means, it can be accessed by Spark SQL only. We’re just using
Hive metastore to help persisting metadata of the data source
table. However, since you’re asking how to access the new table
via Spark SQL CLI, this should work for you. We are working on
making Parquet and ORC data source tables accessible via Hive in
Spark 1.5.0.
Cheng
On 7/22/15 10:32 AM, Jerrick Hoang wrote:
Hi Lian,
Sorry I'm new to Spark so I did not express myself very clearly.
I'm concerned about the situation when let's say I have a Parquet
table some partitions and I add a new column A to parquet schema
and write some data with the new schema to a new partition in the
table. If i'm not mistaken, if I do a
sqlContext.read.parquet(table_path).printSchema() it will print
the correct schema with new column A. But if I do a 'describe
table' from SparkSQLCLI I won't see the new column being added. I
understand that this is because Hive doesn't support schema
evolution. So what is the best way to support CLI queries in this
situation? Do I need to manually alter the table everytime the
underlying schema changes?
Thanks
On Tue, Jul 21, 2015 at 4:37 PM, Cheng Lian
<lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>> wrote:
Hey Jerrick,
What do you mean by "schema evolution with Hive metastore
tables"? Hive doesn't take schema evolution into account.
Could you please give a concrete use case? Are you trying to
write Parquet data with extra columns into an existing
metastore Parquet table?
Cheng
On 7/21/15 1:04 AM, Jerrick Hoang wrote:
I'm new to Spark, any ideas would be much appreciated! Thanks
On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang
<jerrickho...@gmail.com <mailto:jerrickho...@gmail.com>> wrote:
Hi all,
I'm aware of the support for schema evolution via
DataFrame API. Just wondering what would be the best way
to go about dealing with schema evolution with Hive
metastore tables. So, say I create a table via SparkSQL
CLI, how would I deal with Parquet schema evolution?
Thanks,
J