Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

Felix Cheung Tue, 15 Jan 2019 10:07:37 -0800

One common case we have is a custom input format.

In any case, even when Hive metatstore is protocol compatible we should still 
upgrade or replace the hive jar from a fork, as Sean says, from a ASF release 
process standpoint. Unless there is a plan for removing hive integration (all 
of it) from the spark core project..

________________________________
From: Xiao Li <gatorsm...@gmail.com>
Sent: Tuesday, January 15, 2019 10:03 AM
To: Felix Cheung
Cc: rb...@netflix.com; Yuming Wang; dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

Let me take my words back. To read/write a table, Spark users do not use the 
Hive execution JARs, unless they explicitly create the Hive serde tables. 
Actually, I want to understand the motivation and use cases why your usage 
scenarios need to create Hive serde tables instead of our Spark native tables?

BTW, we are still using Hive metastore as our metadata store. This does not 
require the Hive execution JAR upgrade, based on my understanding. Users can 
upgrade it to the newer version of Hive metastore.

Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> 
于2019年1月15日周二 上午9:56写道：
And we are super 100% dependent on Hive...

________________________________
From: Ryan Blue <rb...@netflix.com.invalid>
Sent: Tuesday, January 15, 2019 9:53 AM
To: Xiao Li
Cc: Yuming Wang; dev
Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

How do we know that most Spark users are not using Hive? I wouldn't be 
surprised either way, but I do want to make sure we aren't making decisions 
based on any one person's (or one company's) experience about what "most" Spark 
users do.

On Tue, Jan 15, 2019 at 9:44 AM Xiao Li 
<gatorsm...@gmail.com<mailto:gatorsm...@gmail.com>> wrote:
Hi, Yuming,

Thank you for your contributions! The community aims at reducing the dependence 
on Hive. Currently, most of Spark users are not using Hive. The changes looks 
risky to me.

To support Hadoop 3.x, we just need to resolve this JIRA: 
https://issues.apache.org/jira/browse/HIVE-16391

Cheers,

Xiao

Yuming Wang <wgy...@gmail.com<mailto:wgy...@gmail.com>> 于2019年1月15日周二 上午8:41写道：
Dear Spark Developers and Users,

Hyukjin and I plan to upgrade the built-in Hive 
from1.2.1-spark2<https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2> 
to2.3.4<https://github.com/apache/hive/releases/tag/rel%2Frelease-2.3.4> to 
solve some critical issues, such as support Hadoop 3.x, solve some ORC and 
Parquet issues. This is the list:
Hive issues:
[SPARK-26332<https://issues.apache.org/jira/browse/SPARK-26332>][HIVE-10790] 
Spark sql write orc table on viewFS throws exception
[SPARK-25193<https://issues.apache.org/jira/browse/SPARK-25193>][HIVE-12505] 
insert overwrite doesn't throw exception when drop old data fails
[SPARK-26437<https://issues.apache.org/jira/browse/SPARK-26437>][HIVE-13083] 
Decimal data becomes bigint to query, unable to query
[SPARK-25919<https://issues.apache.org/jira/browse/SPARK-25919>][HIVE-11771] 
Date value corrupts when tables are "ParquetHiveSerDe" formatted and target 
table is Partitioned
[SPARK-12014<https://issues.apache.org/jira/browse/SPARK-12014>][HIVE-11100] 
Spark SQL query containing semicolon is broken in Beeline

Spark issues:
[SPARK-23534<https://issues.apache.org/jira/browse/SPARK-23534>] Spark run on 
Hadoop 3.0.0
[SPARK-20202<https://issues.apache.org/jira/browse/SPARK-20202>] Remove 
references to org.spark-project.hive
[SPARK-18673<https://issues.apache.org/jira/browse/SPARK-18673>] Dataframes 
doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[SPARK-24766<https://issues.apache.org/jira/browse/SPARK-24766>] 
CreateHiveTableAsSelect and InsertIntoHiveDir won't generate decimal column 
stats in parquet

Since the code for the hive-thriftserver module has changed too much for this 
upgrade, I split it into two PRs for easy review.
The first PR<https://github.com/apache/spark/pull/23552> does not contain the 
changes of hive-thriftserver. Please ignore the failed test in 
hive-thriftserver.
The second PR<https://github.com/apache/spark/pull/23553> is complete changes.

I have created a Spark distribution for Apache Hadoop 2.7, you might download 
it viaGoogle 
Drive<https://drive.google.com/open?id=1cq2I8hUTs9F4JkFyvRfdOJ5BlxV0ujgt> 
orBaidu Pan<https://pan.baidu.com/s/1b090Ctuyf1CDYS7c0puBqQ>.
Please help review and test. Thanks.

--
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

Reply via email to