Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

Ryan Blue Tue, 15 Jan 2019 10:24:22 -0800

Xiao, thanks for clarifying.

There are a few use cases for metastore tables. Felix mentions a good one,
custom metastore tables. There are also common formats that Spark doesn't
support natively. Spark has CSV support, but the behavior is different from
Hive's delimited format. Hive also supports Sequence file tables. We have a
few of those that are old.


The other use case that comes to mind is mixed-format tables. I don't think
Spark supports a different format per partition without going through the
Hive read path. We use this feature to convert old tables to Parquet by
simply writing new partitions in Parquet format. Without this, it would be
a much more painful migration process. Only the jobs that read older
partitions need to go through Hive, so converting to a HadoopFs table
usually works.

rb

On Tue, Jan 15, 2019 at 10:07 AM Felix Cheung <[email protected]>
wrote:

> One common case we have is a custom input format.
>
> In any case, even when Hive metatstore is protocol compatible we should
> still upgrade or replace the hive jar from a fork, as Sean says, from a ASF
> release process standpoint. Unless there is a plan for removing hive
> integration (all of it) from the spark core project..
>
>
> ------------------------------
> *From:* Xiao Li <[email protected]>
> *Sent:* Tuesday, January 15, 2019 10:03 AM
> *To:* Felix Cheung
> *Cc:* [email protected]; Yuming Wang; dev
> *Subject:* Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>
> Let me take my words back. To read/write a table, Spark users do not use
> the Hive execution JARs, unless they explicitly create the Hive serde
> tables. Actually, I want to understand the motivation and use cases why
> your usage scenarios need to create Hive serde tables instead of our Spark
> native tables?
>
> BTW, we are still using Hive metastore as our metadata store. This does
> not require the Hive execution JAR upgrade, based on my understanding.
> Users can upgrade it to the newer version of Hive metastore.
>
> Felix Cheung <[email protected]> 于2019年1月15日周二 上午9:56写道：
>
>> And we are super 100% dependent on Hive...
>>
>>
>> ------------------------------
>> *From:* Ryan Blue <[email protected]>
>> *Sent:* Tuesday, January 15, 2019 9:53 AM
>> *To:* Xiao Li
>> *Cc:* Yuming Wang; dev
>> *Subject:* Re: [DISCUSS] Upgrade built-in Hive to 2.3.4
>>
>> How do we know that most Spark users are not using Hive? I wouldn't be
>> surprised either way, but I do want to make sure we aren't making decisions
>> based on any one person's (or one company's) experience about what "most"
>> Spark users do.
>>
>> On Tue, Jan 15, 2019 at 9:44 AM Xiao Li <[email protected]> wrote:
>>
>>> Hi, Yuming,
>>>
>>> Thank you for your contributions! The community aims at reducing the
>>> dependence on Hive. Currently, most of Spark users are not using Hive. The
>>> changes looks risky to me.
>>>
>>> To support Hadoop 3.x, we just need to resolve this JIRA:
>>> https://issues.apache.org/jira/browse/HIVE-16391
>>>
>>> Cheers,
>>>
>>> Xiao
>>>
>>> Yuming Wang <[email protected]> 于2019年1月15日周二 上午8:41写道：
>>>
>>>> Dear Spark Developers and Users,
>>>>
>>>>
>>>>
>>>> Hyukjin and I plan to upgrade the built-in Hive from1.2.1-spark2
>>>> <https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2> to2.3.4
>>>> <https://github.com/apache/hive/releases/tag/rel%2Frelease-2.3.4> to
>>>> solve some critical issues, such as support Hadoop 3.x, solve some ORC and
>>>> Parquet issues. This is the list:
>>>>
>>>> *Hive issues*:
>>>>
>>>> [SPARK-26332 
>>>> <https://issues.apache.org/jira/browse/SPARK-26332>][HIVE-10790]
>>>> Spark sql write orc table on viewFS throws exception
>>>>
>>>> [SPARK-25193 
>>>> <https://issues.apache.org/jira/browse/SPARK-25193>][HIVE-12505]
>>>> insert overwrite doesn't throw exception when drop old data fails
>>>>
>>>> [SPARK-26437 
>>>> <https://issues.apache.org/jira/browse/SPARK-26437>][HIVE-13083]
>>>> Decimal data becomes bigint to query, unable to query
>>>>
>>>> [SPARK-25919 
>>>> <https://issues.apache.org/jira/browse/SPARK-25919>][HIVE-11771]
>>>> Date value corrupts when tables are "ParquetHiveSerDe" formatted and target
>>>> table is Partitioned
>>>>
>>>> [SPARK-12014 
>>>> <https://issues.apache.org/jira/browse/SPARK-12014>][HIVE-11100]
>>>> Spark SQL query containing semicolon is broken in Beeline
>>>>
>>>>
>>>>
>>>> *Spark issues*:
>>>>
>>>> [SPARK-23534 <https://issues.apache.org/jira/browse/SPARK-23534>]
>>>> Spark run on Hadoop 3.0.0
>>>>
>>>> [SPARK-20202 <https://issues.apache.org/jira/browse/SPARK-20202>]
>>>> Remove references to org.spark-project.hive
>>>>
>>>> [SPARK-18673 <https://issues.apache.org/jira/browse/SPARK-18673>]
>>>> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
>>>>
>>>> [SPARK-24766 <https://issues.apache.org/jira/browse/SPARK-24766>]
>>>> CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column
>>>> stats in parquet
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Since the code for the *hive-thriftserver* module has changed too much
>>>> for this upgrade, I split it into two PRs for easy review.
>>>>
>>>> The first PR <https://github.com/apache/spark/pull/23552> does not
>>>> contain the changes of hive-thriftserver. Please ignore the failed test in
>>>> hive-thriftserver.
>>>>
>>>> The second PR <https://github.com/apache/spark/pull/23553> is complete
>>>> changes.
>>>>
>>>>
>>>>
>>>> I have created a Spark distribution for Apache Hadoop 2.7, you might
>>>> download it viaGoogle Drive
>>>> <https://drive.google.com/open?id=1cq2I8hUTs9F4JkFyvRfdOJ5BlxV0ujgt> 
>>>> orBaidu
>>>> Pan <https://pan.baidu.com/s/1b090Ctuyf1CDYS7c0puBqQ>.
>>>>
>>>> Please help review and test. Thanks.
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

Reply via email to