Re: Spark-hive parquet schema evolution

Cheng Lian Wed, 22 Jul 2015 08:34:22 -0700

Yeah, the benefit of `saveAsTable` is that you don't need to deal withschema explicitly, while the benefit of ALTER TABLE is you still have astandard vanilla Hive table.


Cheng


On 7/22/15 11:00 PM, Dean Wampler wrote:

While it's not recommended to overwrite files Hive thinks itunderstands, you can add the column to Hive's metastore using an ALTERTABLE command using HiveQL in the Hive shell or using HiveContext.sql():


ALTER TABLE mytable ADD COLUMNS col_name data_type

Seehttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Columnfor full details.


dean

Dean Wampler, Ph.D.

Author: Programming Scala, 2nd Edition<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)

Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Wed, Jul 22, 2015 at 4:36 AM, Cheng Lian <lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote:


    Since Hive doesn’t support schema evolution, you’ll have to update
    the schema stored in metastore somehow. For example, you can
    create a new external table with the merged schema. Say you have a
    Hive table |t1|:

    |CREATE TABLE t1 (c0 INT, c1 DOUBLE); |

    By default, this table is stored in HDFS path
    |hdfs://some-host:9000/user/hive/warehouse/t1|. Now you append
    some Parquet data with an extra column |c2| to the same directory:

    |import org.apache.spark.sql.types._ val path =
    "hdfs://some-host:9000/user/hive/warehouse/t1" val df1 =
    sqlContext.range(10).select('id as 'c0, 'id cast DoubleType as
    'c1, 'id cast StringType as 'c2)
    df1.write.mode("append").parquet(path) |

    Now you can create a new external table |t2| like this:

    |val df2 = sqlContext.read.option(" mergeSchema",
    "true").parquet(path) df2.write.path(path).saveAsTable("t2") |

    Since we specified a path above, the newly created |t2| is an
    external table pointing to the original HDFS location. But the
    schema of |t2| is the merged version.

    The drawback of this approach is that, |t2| is actually a Spark
    SQL specific data source table rather than a genuine Hive table.
    This means, it can be accessed by Spark SQL only. We’re just using
    Hive metastore to help persisting metadata of the data source
    table. However, since you’re asking how to access the new table
    via Spark SQL CLI, this should work for you. We are working on
    making Parquet and ORC data source tables accessible via Hive in
    Spark 1.5.0.

    Cheng

    On 7/22/15 10:32 AM, Jerrick Hoang wrote:

    Hi Lian,

    Sorry I'm new to Spark so I did not express myself very clearly.
    I'm concerned about the situation when let's say I have a Parquet
    table some partitions and I add a new column A to parquet schema
    and write some data with the new schema to a new partition in the
    table. If i'm not mistaken, if I do a
    sqlContext.read.parquet(table_path).printSchema() it will print
    the correct schema with new column A. But if I do a 'describe
    table' from SparkSQLCLI I won't see the new column being added. I
    understand that this is because Hive doesn't support schema
    evolution. So what is the best way to support CLI queries in this
    situation? Do I need to manually alter the table everytime the
    underlying schema changes?

    Thanks

    On Tue, Jul 21, 2015 at 4:37 PM, Cheng Lian
    <lian.cs....@gmail.com <mailto:lian.cs....@gmail.com>> wrote:

        Hey Jerrick,

        What do you mean by "schema evolution with Hive metastore
        tables"? Hive doesn't take schema evolution into account.
        Could you please give a concrete use case? Are you trying to
        write Parquet data with extra columns into an existing
        metastore Parquet table?

        Cheng


        On 7/21/15 1:04 AM, Jerrick Hoang wrote:

        I'm new to Spark, any ideas would be much appreciated! Thanks

        On Sat, Jul 18, 2015 at 11:11 AM, Jerrick Hoang
        <jerrickho...@gmail.com <mailto:jerrickho...@gmail.com>> wrote:

            Hi all,

            I'm aware of the support for schema evolution via
            DataFrame API. Just wondering what would be the best way
            to go about dealing with schema evolution with Hive
            metastore tables. So, say I create a table via SparkSQL
            CLI, how would I deal with Parquet schema evolution?

            Thanks,
            J

Re: Spark-hive parquet schema evolution

Reply via email to