Re: [VOTE] Apache Spark 2.1.0 (RC5)

Felix Cheung Sun, 18 Dec 2016 20:56:10 -0800

0/+1

Tested a bunch of R package/install cases.
Unfortunately we are still working on SPARK-18817, which looks to be a change 
when going from Spark 1.6 to 2.0. In that case it won't be a blocker.

_____________________________
From: vaquar khan <[email protected]<mailto:[email protected]>>
Sent: Sunday, December 18, 2016 2:33 PM
Subject: Re: [VOTE] Apache Spark 2.1.0 (RC5)
To: Adam Roberts <[email protected]<mailto:[email protected]>>
Cc: Denny Lee <[email protected]<mailto:[email protected]>>, Holden 
Karau <[email protected]<mailto:[email protected]>>, Liwei Lin 
<[email protected]<mailto:[email protected]>>, 
<[email protected]<mailto:[email protected]>>

+1 (non-binding)

Regards,
vaquar khan

On Sun, Dec 18, 2016 at 2:33 PM, Adam Roberts 
<[email protected]<mailto:[email protected]>> wrote:
+1 (non-binding)

Functional: looks good, tested with OpenJDK 8 (1.8.0_111) and IBM's latest SDK 
for Java (8 SR3 FP21).

Tests run clean on Ubuntu 16 04, 14 04, SUSE 12, CentOS 7.2 on x86 and IBM 
specific platforms including big-endian. On slower machines I see these failing 
but nothing to be concerned over (timeouts):

org.apache.spark.DistributedSuite.caching on disk
org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails with 
informative message
org.apache.spark.sql.streaming.StreamingAggregationSuite.prune results by 
current_time, complete mode
org.apache.spark.sql.streaming.StreamingAggregationSuite.prune results by 
current_date, complete mode
org.apache.spark.sql.hive.HiveSparkSubmitSuite.set hive.metastore.warehouse.dir

Performance vs 2.0.2: lots of improvements seen using the HiBench and 
SparkSqlPerf benchmarks, tested with a 48 core Intel machine using the Kryo 
serializer, controlled test environment. These are all open source benchmarks 
anyone can use and experiment with. Elapsed times measured, + scores are an 
improvement (so it's that much percent faster) and- scores are used for 
regressions I'm seeing.

  *   K-means: Java API +22% (100 sec to 78 sec), Scala API+30% (34 seconds to 
24 seconds), Python API unchanged
  *   PageRank: minor improvement from 40 seconds to 38 seconds,+5%
  *   Sort: minor improvement, 10.8 seconds to 9.8 seconds,+10%
  *   WordCount: unchanged
  *   Bayes: mixed bag, sometimes much slower (95 sec to 140 sec) which is-47%, 
other times marginally faster by 15%, something to keep an eye on
  *   Terasort: +18% (39 seconds to 32 seconds) with the Java/Scala APIs

For TPC-DS SQL queries the results are a mixed bag again, I see > 10% boosts 
for q9,  q68, q75, q96 and > 10% slowdowns for q7, q39a, q43, q52, q57, q89. 
Five iterations, average times compared, only changing which version of Spark 
we're using

From:        Holden Karau <[email protected]<mailto:[email protected]>>
To:        Denny Lee <[email protected]<mailto:[email protected]>>, 
Liwei Lin <[email protected]<mailto:[email protected]>>, 
"[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date:        18/12/2016 20:05
Subject:        Re: [VOTE] Apache Spark 2.1.0 (RC5)
________________________________

+1 (non-binding) - checked Python artifacts with virtual env.

On Sun, Dec 18, 2016 at 11:42 AM Denny Lee 
<[email protected]<mailto:[email protected]>> wrote:
+1 (non-binding)

On Sat, Dec 17, 2016 at 11:45 PM Liwei Lin 
<[email protected]<mailto:[email protected]>> wrote:
+1

Cheers,
Liwei

On Sat, Dec 17, 2016 at 10:29 AM, Yuming Wang 
<[email protected]<mailto:[email protected]>> wrote:
I hope https://github.com/apache/spark/pull/16252 can be fixed until release 
2.1.0. It's a fix for broadcast cannot fit in memory.

On Sat, Dec 17, 2016 at 10:23 AM, Joseph Bradley 
<[email protected]<mailto:[email protected]>> wrote:
+1

On Fri, Dec 16, 2016 at 3:21 PM, Herman van Hövell tot Westerflier 
<[email protected]<mailto:[email protected]>> wrote:
+1

On Sat, Dec 17, 2016 at 12:14 AM, Xiao Li 
<[email protected]<mailto:[email protected]>> wrote:
+1

Xiao Li

2016-12-16 12:19 GMT-08:00 Felix Cheung 
<[email protected]<mailto:[email protected]>>:

For R we have a license field in the DESCRIPTION, and this is standard practice 
(and requirement) for R packages.

https://cran.r-project.org/doc/manuals/R-exts.html#Licensing

________________________________

From: Sean Owen <[email protected]<mailto:[email protected]>>

Sent: Friday, December 16, 2016 9:57:15 AM

To: Reynold Xin; [email protected]<mailto:[email protected]>

Subject: Re: [VOTE] Apache Spark 2.1.0 (RC5)

(If you have a template for these emails, maybe update it to use https links. 
They work for

apache.org<http://apache.org/> domains. After all we are asking people to 
verify the integrity of release artifacts, so it might as well be secure.)

(Also the new archives use .tar.gz instead of .tgz like the others. No big 
deal, my OCD eye just noticed it.)

I don't see an Apache license / notice for the Pyspark or SparkR artifacts. It 
would be good practice to include this in a convenience binary. I'm not sure if 
it's strictly mandatory, but something to adjust in any event. I think that's 
all there is to

do for SparkR. For Pyspark, which packages a bunch of dependencies, it does 
include the licenses (good) but I think it should include the NOTICE file.

This is the first time I recall getting 0 test failures off the bat!

I'm using Java 8 / Ubuntu 16 and yarn/hive/hadoop-2.7 profiles.

I think I'd +1 this therefore unless someone knows that the license issue above 
is real and a blocker.

On Fri, Dec 16, 2016 at 5:17 AM Reynold Xin 
<[email protected]<mailto:[email protected]>> wrote:

Please vote on releasing the following candidate as Apache Spark version 2.1.0. 
The vote is open until Sun, December 18, 2016 at 21:30 PT and passes if a 
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.1.0

[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see

http://spark.apache.org/

The tag to be voted on is v2.1.0-rc5 (cd0a08361e2526519e7c131c42116bf56fa62c76)

List of JIRA tickets resolved are:  
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0

The release files, including signatures, digests, etc. can be found at:

http://home.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-bin/

Release artifacts are signed with the following key:

https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1223/

The documentation corresponding to this release can be found at:

http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-docs/

FAQ

How can I help test this release?

If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then reporting 
any regressions.

What should happen to JIRA tickets still targeting 2.1.0?

Committers should look at those and triage. Extremely important bug fixes, 
documentation, and API tweaks that impact compatibility should be worked on 
immediately. Everything else please retarget to 2.1.1 or 2.2.0.

What happened to RC3/RC5?

They had issues withe release packaging and as a result were skipped.

--
Herman van Hövell
Software Engineer
Databricks Inc.
[email protected]<mailto:[email protected]>
+31 6 420 590 27
databricks.com<http://databricks.com/>
<http://databricks.com/>

--
Joseph Bradley
Software Engineer - Machine Learning
Databricks, Inc.
<http://databricks.com/>

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

--
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago

Re: [VOTE] Apache Spark 2.1.0 (RC5)

Reply via email to