Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Marcelo Vanzin
+1 to that. HIVE-16391 by itself means we're giving up things like Hadoop 3, and we're also putting the burden on the Hive folks to fix a problem that we created. The current PR is basically a Spark-side fix for that bug. It does mean also upgrading Hive (which gives us Hadoop 3, yay!), but I

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Hyukjin Kwon
Resolving HIVE-16391 means Hive to release 1.2.x that contains the fixes of our Hive fork (correct me if I am mistaken). Just to be honest by myself and as a personal opinion, that basically says Hive to take care of Spark's dependency. Hive looks going ahead for 3.1.x and no one would use the

Re: [ANNOUNCE] Announcing Apache Spark 2.2.3

2019-01-15 Thread Jeff Zhang
Congrats, Great work Dongjoon. Dongjoon Hyun 于2019年1月15日周二 下午3:47写道: > We are happy to announce the availability of Spark 2.2.3! > > Apache Spark 2.2.3 is a maintenance release, based on the branch-2.2 > maintenance branch of Spark. We strongly recommend all 2.2.x users to > upgrade to this

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Sean Owen
It's almost certainly needed just to get off the fork of Hive we're not supposed to have. Yes it's going to impact dependencies, so would need to happen at Spark 3. Separately, its usage could be reduced or removed -- this I don't know much about. But it doesn't really make it harder or easier.

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Xiao Li
If https://issues.apache.org/jira/browse/HIVE-16391 can be resolved, we do not need to keep our fork of Hive. Sean Owen 于2019年1月15日周二 上午10:44写道: > It's almost certainly needed just to get off the fork of Hive we're > not supposed to have. Yes it's going to impact dependencies, so would > need

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Xiao Li
Since Spark 2.0, we have been trying to move all the Hive-specific logics to a separate package and make Hive as a data source like the other built-in data sources. You might see a lot of refactoring PRs for this goal. Hive will be still an important data source Spark supports for sure. Now, the

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Ryan Blue
Xiao, thanks for clarifying. There are a few use cases for metastore tables. Felix mentions a good one, custom metastore tables. There are also common formats that Spark doesn't support natively. Spark has CSV support, but the behavior is different from Hive's delimited format. Hive also supports

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Marcelo Vanzin
The metastore interactions in Spark are currently based on APIs that are in the Hive exec jar; so that makes it not possible to have Spark work with Hadoop 3 until the exec jar is upgraded. It could be possible to re-implement those interactions based solely on the metastore client Hive

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Felix Cheung
One common case we have is a custom input format. In any case, even when Hive metatstore is protocol compatible we should still upgrade or replace the hive jar from a fork, as Sean says, from a ASF release process standpoint. Unless there is a plan for removing hive integration (all of it)

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Xiao Li
Let me take my words back. To read/write a table, Spark users do not use the Hive execution JARs, unless they explicitly create the Hive serde tables. Actually, I want to understand the motivation and use cases why your usage scenarios need to create Hive serde tables instead of our Spark native

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Sean Owen
Unless it's going away entirely, and I don't think it is, we at least have to do this to get off the fork of Hive that's being used now. I do think we want to keep Hive from getting into the core though -- see comments on PR. On Tue, Jan 15, 2019 at 11:44 AM Xiao Li wrote: > > Hi, Yuming, > >

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Felix Cheung
And we are super 100% dependent on Hive... From: Ryan Blue Sent: Tuesday, January 15, 2019 9:53 AM To: Xiao Li Cc: Yuming Wang; dev Subject: Re: [DISCUSS] Upgrade built-in Hive to 2.3.4 How do we know that most Spark users are not using Hive? I wouldn't be

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Felix Cheung
Resolving https://issues.apache.org/jira/browse/HIVE-16391 means to keep Spark on Hive 1.2? I’m not sure that is reducing dependency on Hive - Hive is still there and it’s a very old Hive. IMO it is increasing the risk the longer we keep on this. (And it’s been years) Looking at the two PR.

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Ryan Blue
How do we know that most Spark users are not using Hive? I wouldn't be surprised either way, but I do want to make sure we aren't making decisions based on any one person's (or one company's) experience about what "most" Spark users do. On Tue, Jan 15, 2019 at 9:44 AM Xiao Li wrote: > Hi,

Re: [DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Xiao Li
Hi, Yuming, Thank you for your contributions! The community aims at reducing the dependence on Hive. Currently, most of Spark users are not using Hive. The changes looks risky to me. To support Hadoop 3.x, we just need to resolve this JIRA: https://issues.apache.org/jira/browse/HIVE-16391

SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-01-15 Thread Xiangrui Meng
Hi all, I want to re-send the previous SPIP on introducing a DataFrame-based graph component to collect more feedback. It supports property graphs, Cypher graph queries, and graph algorithms built on top of the DataFrame API. If you are a GraphX user or your workload is essentially graph queries,

[DISCUSS] Upgrade built-in Hive to 2.3.4

2019-01-15 Thread Yuming Wang
Dear Spark Developers and Users, Hyukjin and I plan to upgrade the built-in Hive from 1.2.1-spark2 to 2.3.4 to solve some critical issues, such as support Hadoop 3.x,

[ANNOUNCE] Apache Roadshow Chicago, Call for Presentations

2019-01-15 Thread Trevor Grant
Hello Devs! You're receiving this email because you are subscribed to one or more Apache developer email lists. I’m writing to let you know about an exciting event coming to the Chicago area: The Apache Roadshow Chicago. It will be held May 13th and 14th at three bars in the Logan Square