Re: Best way to Hive to Spark migration

2018-04-04 Thread Jörn Franke
You need to provide more context on what you do currently in Hive and what do 
you expect from the migration.

> On 5. Apr 2018, at 05:43, Pralabh Kumar  wrote:
> 
> Hi Spark group
> 
> What's the best way to Migrate Hive to Spark
> 
> 1) Use HiveContext of Spark
> 2) Use Hive on Spark 
> (https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started)
> 3) Migrate Hive to Calcite to Spark SQL
> 
> 
> Regards
> 


Best way to Hive to Spark migration

2018-04-04 Thread Pralabh Kumar
Hi Spark group

What's the best way to Migrate Hive to Spark

1) Use HiveContext of Spark
2) Use Hive on Spark (
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
)
3) Migrate Hive to Calcite to Spark SQL


Regards


time for Apache Spark 3.0?

2018-04-04 Thread Reynold Xin
There was a discussion thread on scala-contributors

about Apache Spark not yet supporting Scala 2.12, and that got me to think
perhaps it is about time for Spark to work towards the 3.0 release. By the
time it comes out, it will be more than 2 years since Spark 2.0.

For contributors less familiar with Spark’s history, I want to give more
context on Spark releases:

1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If
we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0
in 2018.

2. Spark’s versioning policy promises that Spark does not break stable APIs
in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a
necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to
3.0).

3. That said, a major version isn’t necessarily the playground for
disruptive API changes to make it painful for users to update. The main
purpose of a major release is an opportunity to fix things that are broken
in the current API and remove certain deprecated APIs.

4. Spark as a project has a culture of evolving architecture and developing
major new features incrementally, so major releases are not the only time
for exciting new features. For example, the bulk of the work in the move
towards the DataFrame API was done in Spark 1.3, and Continuous Processing
was introduced in Spark 2.3. Both were feature releases rather than major
releases.


You can find more background in the thread discussing Spark 2.0:
http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html


The primary motivating factor IMO for a major version bump is to support
Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
Similar to Spark 2.0, I think there are also opportunities for other
changes that we know have been biting us for a long time but can’t be
changed in feature releases (to be clear, I’m actually not sure they are
all good ideas, but I’m writing them down as candidates for consideration):

1. Support Scala 2.12.

2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark
2.x.

3. Shade all dependencies.

4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant,
to prevent users from shooting themselves in the foot, e.g. “SELECT 2
SECOND” -- is “SECOND” an interval unit or an alias? To make it less
painful for users to upgrade here, I’d suggest creating a flag for backward
compatibility mode.

5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard
compliant, and have a flag for backward compatibility.

6. Miscellaneous other small changes documented in JIRA already (e.g.
“JavaPairRDD flatMapValues requires function returning Iterable, not
Iterator”, “Prevent column name duplication in temporary view”).


Now the reality of a major version bump is that the world often thinks in
terms of what exciting features are coming. I do think there are a number
of major changes happening already that can be part of the 3.0 release, if
they make it in:

1. Scala 2.12 support (listing it twice)
2. Continuous Processing non-experimental
3. Kubernetes support non-experimental
4. A more flushed out version of data source API v2 (I don’t think it is
realistic to stabilize that in one release)
5. Hadoop 3.0 support
6. ...



Similar to the 2.0 discussion, this thread should focus on the framework
and whether it’d make sense to create Spark 3.0 as the next release, rather
than the individual feature requests. Those are important but are best done
in their own separate threads.


Re: Hadoop 3 support

2018-04-04 Thread Felix Cheung
What would be the strategy with hive? Cherry pick patches? Update to more 
“modern” versions (like 2.3?)

I know of a few critical schema evolution fixes that we could port to hive 
1.2.1-spark


_
From: Steve Loughran 
Sent: Tuesday, April 3, 2018 1:33 PM
Subject: Re: Hadoop 3 support
To: Apache Spark Dev 




On 3 Apr 2018, at 01:30, Saisai Shao 
> wrote:

Yes, the main blocking issue is the hive version used in Spark (1.2.1.spark) 
doesn't support run on Hadoop 3. Hive will check the Hadoop version in the 
runtime [1]. Besides this I think some pom changes should be enough to support 
Hadoop 3.

If we want to use Hadoop 3 shaded client jar, then the pom requires lots of 
changes, but this is not necessary.


[1] 
https://github.com/apache/hive/blob/6751225a5cde4c40839df8b46e8d241fdda5cd34/shims/common/src/main/java/org/apache/hadoop/hive/shims/ShimLoader.java#L144

2018-04-03 4:57 GMT+08:00 Marcelo Vanzin 
>:
Saisai filed SPARK-23534, but the main blocking issue is really SPARK-18673.


On Mon, Apr 2, 2018 at 1:00 PM, Reynold Xin 
> wrote:
> Does anybody know what needs to be done in order for Spark to support Hadoop
> 3?
>


To be ruthless, I'd view Hadoop 3.1 as the first one to play with...3.0.x was 
more of a wide-version check. Hadoop 3.1RC0 is out this week, making it the 
ideal (last!) time to find showstoppers.

1. I've got a PR which adds a profile to build spark against hadoop 3, with 
some fixes for zk import along with better hadoop-cloud profile

https://github.com/apache/spark/pull/20923


Apply that and patch and both mvn and sbt can build with the RC0 from the ASF 
staging repo:

build/sbt -Phadoop-3,hadoop-cloud,yarn -Psnapshots-and-staging



2. Everything Marcelo says about hive.

You can build hadoop locally with a -Dhadoop.version=2.11 and the hive 
1.2.1.-spark version check goes through. You can't safely bring up HDFS like 
that, but you can run spark standalone against things

Some strategies

Short term: build a new hive-1,2.x-spark which fixes up the version check and 
merges in those critical patches that cloudera, hortoworks, databricks, + 
anyone else has got in for their production systems. I don't think we have that 
many.

That leaves a "how to release" story, as the ASF will want it to come out under 
the ASF auspices, and, given the liability disclaimers, so should everyone. The 
Hive team could be "invited" to publish it as their own if people ask nicely.

Long term
 -do something about that subclassing to get the thrift endpoint to work. That 
can include fixing hive's service to be subclass friendly.
 -move to hive 2

That' s a major piece of work.