Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-18 Thread Felix Cheung
0/+1

Tested a bunch of R package/install cases.
Unfortunately we are still working on SPARK-18817, which looks to be a change 
when going from Spark 1.6 to 2.0. In that case it won't be a blocker.


_
From: vaquar khan >
Sent: Sunday, December 18, 2016 2:33 PM
Subject: Re: [VOTE] Apache Spark 2.1.0 (RC5)
To: Adam Roberts >
Cc: Denny Lee >, Holden 
Karau >, Liwei Lin 
>, 
>


+1 (non-binding)

Regards,
vaquar khan

On Sun, Dec 18, 2016 at 2:33 PM, Adam Roberts 
> wrote:
+1 (non-binding)

Functional: looks good, tested with OpenJDK 8 (1.8.0_111) and IBM's latest SDK 
for Java (8 SR3 FP21).

Tests run clean on Ubuntu 16 04, 14 04, SUSE 12, CentOS 7.2 on x86 and IBM 
specific platforms including big-endian. On slower machines I see these failing 
but nothing to be concerned over (timeouts):

org.apache.spark.DistributedSuite.caching on disk
org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails with 
informative message
org.apache.spark.sql.streaming.StreamingAggregationSuite.prune results by 
current_time, complete mode
org.apache.spark.sql.streaming.StreamingAggregationSuite.prune results by 
current_date, complete mode
org.apache.spark.sql.hive.HiveSparkSubmitSuite.set hive.metastore.warehouse.dir

Performance vs 2.0.2: lots of improvements seen using the HiBench and 
SparkSqlPerf benchmarks, tested with a 48 core Intel machine using the Kryo 
serializer, controlled test environment. These are all open source benchmarks 
anyone can use and experiment with. Elapsed times measured, + scores are an 
improvement (so it's that much percent faster) and- scores are used for 
regressions I'm seeing.

  *   K-means: Java API +22% (100 sec to 78 sec), Scala API+30% (34 seconds to 
24 seconds), Python API unchanged
  *   PageRank: minor improvement from 40 seconds to 38 seconds,+5%
  *   Sort: minor improvement, 10.8 seconds to 9.8 seconds,+10%
  *   WordCount: unchanged
  *   Bayes: mixed bag, sometimes much slower (95 sec to 140 sec) which is-47%, 
other times marginally faster by 15%, something to keep an eye on
  *   Terasort: +18% (39 seconds to 32 seconds) with the Java/Scala APIs

For TPC-DS SQL queries the results are a mixed bag again, I see > 10% boosts 
for q9,  q68, q75, q96 and > 10% slowdowns for q7, q39a, q43, q52, q57, q89. 
Five iterations, average times compared, only changing which version of Spark 
we're using



From:Holden Karau >
To:Denny Lee >, 
Liwei Lin >, 
"dev@spark.apache.org" 
>
Date:18/12/2016 20:05
Subject:Re: [VOTE] Apache Spark 2.1.0 (RC5)




+1 (non-binding) - checked Python artifacts with virtual env.

On Sun, Dec 18, 2016 at 11:42 AM Denny Lee 
> wrote:
+1 (non-binding)


On Sat, Dec 17, 2016 at 11:45 PM Liwei Lin 
> wrote:
+1

Cheers,
Liwei



On Sat, Dec 17, 2016 at 10:29 AM, Yuming Wang 
> wrote:
I hope https://github.com/apache/spark/pull/16252 can be fixed until release 
2.1.0. It's a fix for broadcast cannot fit in memory.

On Sat, Dec 17, 2016 at 10:23 AM, Joseph Bradley 
> wrote:
+1

On Fri, Dec 16, 2016 at 3:21 PM, Herman van Hövell tot Westerflier 
> wrote:
+1

On Sat, Dec 17, 2016 at 12:14 AM, Xiao Li 
> wrote:
+1

Xiao Li

2016-12-16 12:19 GMT-08:00 Felix Cheung 
>:












For R we have a license field in the DESCRIPTION, and this is standard practice 
(and requirement) for R packages.







https://cran.r-project.org/doc/manuals/R-exts.html#Licensing










From: Sean Owen >


Sent: Friday, December 16, 2016 9:57:15 AM


To: Reynold Xin; dev@spark.apache.org


Subject: Re: [VOTE] Apache Spark 2.1.0 (RC5)










(If you have a template for these emails, maybe update it to use https links. 
They work for

apache.org domains. After all we are asking people to 
verify the integrity of release artifacts, so it might as well be secure.)







(Also the new archives use .tar.gz instead of .tgz like the others. No big 

Re: Aggregating over sorted data

2016-12-18 Thread Liang-Chi Hsieh

Hi,

As I know, Spark SQL doesn't provide native support for this feature now.
After searching, I found only few database systems support it, e.g.,
PostgreSQL.

Actually based on the Spark SQL's aggregate system, I think it is not very
difficult to add the support for this feature. The problem is how frequently
this feature is needed for Spark SQL users and if it is worth adding this,
because as I see, this feature is not very common.

Alternative possible to achieve this in current Spark SQL, is to use
Aggregator with Dataset API. You can write your custom Aggregator which has
an user-defined JVM object as buffer to hold the input data into your
aggregate function. But you may need to write necessary encoder for the
buffer object.

If you really need this feature, you may open a Jira to ask others' opinion
about this feature.






-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp1p20273.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-18 Thread vaquar khan
+1 (non-binding)

Regards,
vaquar khan

On Sun, Dec 18, 2016 at 2:33 PM, Adam Roberts  wrote:

> +1 (non-binding)
>
> *Functional*: looks good, tested with OpenJDK 8 (1.8.0_111) and IBM's
> latest SDK for Java (8 SR3 FP21).
>
> Tests run clean on Ubuntu 16 04, 14 04, SUSE 12, CentOS 7.2 on x86 and IBM
> specific platforms including big-endian. On slower machines I see these
> failing but nothing to be concerned over (timeouts):
>
> *org.apache.spark.DistributedSuite.caching on disk*
> *org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails
> with informative message*
> *org.apache.spark.sql.streaming.StreamingAggregationSuite.prune results by
> current_time, complete mode*
> *org.apache.spark.sql.streaming.StreamingAggregationSuite.prune results by
> current_date, complete mode*
> *org.apache.spark.sql.hive.HiveSparkSubmitSuite.set
> hive.metastore.warehouse.dir*
>
> *Performance vs 2.0.2:* lots of improvements seen using the HiBench and
> SparkSqlPerf benchmarks, tested with a 48 core Intel machine using the Kryo
> serializer, controlled test environment. These are all open source
> benchmarks anyone can use and experiment with. Elapsed times measured, *+
> scores* are an improvement (so it's that much percent faster) and *-
> scores* are used for regressions I'm seeing.
>
>- K-means: Java API *+22%* (100 sec to 78 sec), Scala API *+30%* (34
>seconds to 24 seconds), Python API unchanged
>- PageRank: minor improvement from 40 seconds to 38 seconds, *+5%*
>- Sort: minor improvement, 10.8 seconds to 9.8 seconds, *+10%*
>- WordCount: unchanged
>- Bayes: mixed bag, sometimes much slower (95 sec to 140 sec) which is
>*-47%*, other times marginally faster by *15%*, something to keep an
>eye on
>- Terasort: *+18%* (39 seconds to 32 seconds) with the Java/Scala APIs
>
>
> For TPC-DS SQL queries the results are a mixed bag again, I see > 10%
> boosts for q9,  q68, q75, q96 and > 10% slowdowns for q7, q39a, q43, q52,
> q57, q89. Five iterations, average times compared, only changing which
> version of Spark we're using
>
>
>
> From:Holden Karau 
> To:Denny Lee , Liwei Lin ,
> "dev@spark.apache.org" 
> Date:18/12/2016 20:05
> Subject:Re: [VOTE] Apache Spark 2.1.0 (RC5)
> --
>
>
>
> +1 (non-binding) - checked Python artifacts with virtual env.
>
> On Sun, Dec 18, 2016 at 11:42 AM Denny Lee <*denny.g@gmail.com*
> > wrote:
> +1 (non-binding)
>
>
> On Sat, Dec 17, 2016 at 11:45 PM Liwei Lin <*lwl...@gmail.com*
> > wrote:
> +1
>
> Cheers,
> Liwei
>
>
>
> On Sat, Dec 17, 2016 at 10:29 AM, Yuming Wang <*wgy...@gmail.com*
> > wrote:
> I hope *https://github.com/apache/spark/pull/16252*
>  can be fixed until release
> 2.1.0. It's a fix for broadcast cannot fit in memory.
>
> On Sat, Dec 17, 2016 at 10:23 AM, Joseph Bradley <*jos...@databricks.com*
> > wrote:
> +1
>
> On Fri, Dec 16, 2016 at 3:21 PM, Herman van Hövell tot Westerflier <
> *hvanhov...@databricks.com* > wrote:
> +1
>
> On Sat, Dec 17, 2016 at 12:14 AM, Xiao Li <*gatorsm...@gmail.com*
> > wrote:
> +1
>
> Xiao Li
>
> 2016-12-16 12:19 GMT-08:00 Felix Cheung <*felixcheun...@hotmail.com*
> >:
>
>
>
>
>
>
>
>
>
>
>
>
> For R we have a license field in the DESCRIPTION, and this is standard
> practice (and requirement) for R packages.
>
>
>
>
>
>
>
> *https://cran.r-project.org/doc/manuals/R-exts.html#Licensing*
> 
>
>
>
>
>
>
>
> --
>
>
> *From:* Sean Owen <*so...@cloudera.com* >
>
>
> * Sent:* Friday, December 16, 2016 9:57:15 AM
>
>
> * To:* Reynold Xin; *dev@spark.apache.org* 
>
>
> * Subject:* Re: [VOTE] Apache Spark 2.1.0 (RC5)
>
>
>
>
>
>
>
>
>
>
> (If you have a template for these emails, maybe update it to use https
> links. They work for
>
> *apache.org*  domains. After all we are asking people
> to verify the integrity of release artifacts, so it might as well be
> secure.)
>
>
>
>
>
>
>
> (Also the new archives use .tar.gz instead of .tgz like the others. No big
> deal, my OCD eye just noticed it.)
>
>
>
>
>
>
>
> I don't see an Apache license / notice for the Pyspark or SparkR
> artifacts. It would be good practice to include this in a convenience
> binary. I'm not sure if it's strictly mandatory, but something to adjust in
> any event. I think that's all there is to
>
> do for SparkR. For Pyspark, which packages a bunch of dependencies, it
> does include the licenses (good) but I think it should include the NOTICE
> file.
>
>
>
>
>
>
>
> This is the first time I recall getting 0 test 

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-18 Thread Adam Roberts
+1 (non-binding)

Functional: looks good, tested with OpenJDK 8 (1.8.0_111) and IBM's latest 
SDK for Java (8 SR3 FP21).

Tests run clean on Ubuntu 16 04, 14 04, SUSE 12, CentOS 7.2 on x86 and IBM 
specific platforms including big-endian. On slower machines I see these 
failing but nothing to be concerned over (timeouts):

org.apache.spark.DistributedSuite.caching on disk
org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails 
with informative message
org.apache.spark.sql.streaming.StreamingAggregationSuite.prune results by 
current_time, complete mode
org.apache.spark.sql.streaming.StreamingAggregationSuite.prune results by 
current_date, complete mode
org.apache.spark.sql.hive.HiveSparkSubmitSuite.set 
hive.metastore.warehouse.dir

Performance vs 2.0.2: lots of improvements seen using the HiBench and 
SparkSqlPerf benchmarks, tested with a 48 core Intel machine using the 
Kryo serializer, controlled test environment. These are all open source 
benchmarks anyone can use and experiment with. Elapsed times measured, + 
scores are an improvement (so it's that much percent faster) and - scores 
are used for regressions I'm seeing.

K-means: Java API +22% (100 sec to 78 sec), Scala API +30% (34 seconds to 
24 seconds), Python API unchanged
PageRank: minor improvement from 40 seconds to 38 seconds, +5%
Sort: minor improvement, 10.8 seconds to 9.8 seconds, +10%
WordCount: unchanged
Bayes: mixed bag, sometimes much slower (95 sec to 140 sec) which is -47%, 
other times marginally faster by 15%, something to keep an eye on
Terasort: +18% (39 seconds to 32 seconds) with the Java/Scala APIs

For TPC-DS SQL queries the results are a mixed bag again, I see > 10% 
boosts for q9,  q68, q75, q96 and > 10% slowdowns for q7, q39a, q43, q52, 
q57, q89. Five iterations, average times compared, only changing which 
version of Spark we're using



From:   Holden Karau 
To: Denny Lee , Liwei Lin , 
"dev@spark.apache.org" 
Date:   18/12/2016 20:05
Subject:Re: [VOTE] Apache Spark 2.1.0 (RC5)



+1 (non-binding) - checked Python artifacts with virtual env.

On Sun, Dec 18, 2016 at 11:42 AM Denny Lee  wrote:
+1 (non-binding)


On Sat, Dec 17, 2016 at 11:45 PM Liwei Lin  wrote:
+1

Cheers,
Liwei



On Sat, Dec 17, 2016 at 10:29 AM, Yuming Wang  wrote:
I hope https://github.com/apache/spark/pull/16252 can be fixed until 
release 2.1.0. It's a fix for broadcast cannot fit in memory.

On Sat, Dec 17, 2016 at 10:23 AM, Joseph Bradley  
wrote:
+1

On Fri, Dec 16, 2016 at 3:21 PM, Herman van Hövell tot Westerflier <
hvanhov...@databricks.com> wrote:
+1

On Sat, Dec 17, 2016 at 12:14 AM, Xiao Li  wrote:
+1

Xiao Li

2016-12-16 12:19 GMT-08:00 Felix Cheung :












For R we have a license field in the DESCRIPTION, and this is standard 
practice (and requirement) for R packages.







https://cran.r-project.org/doc/manuals/R-exts.html#Licensing









From: Sean Owen 


Sent: Friday, December 16, 2016 9:57:15 AM


To: Reynold Xin; dev@spark.apache.org


Subject: Re: [VOTE] Apache Spark 2.1.0 (RC5)

 








(If you have a template for these emails, maybe update it to use https 
links. They work for

apache.org domains. After all we are asking people to verify the integrity 
of release artifacts, so it might as well be secure.)







(Also the new archives use .tar.gz instead of .tgz like the others. No big 
deal, my OCD eye just noticed it.)







I don't see an Apache license / notice for the Pyspark or SparkR 
artifacts. It would be good practice to include this in a convenience 
binary. I'm not sure if it's strictly mandatory, but something to adjust 
in any event. I think that's all there is to

do for SparkR. For Pyspark, which packages a bunch of dependencies, it 
does include the licenses (good) but I think it should include the NOTICE 
file.







This is the first time I recall getting 0 test failures off the bat!


I'm using Java 8 / Ubuntu 16 and yarn/hive/hadoop-2.7 profiles.







I think I'd +1 this therefore unless someone knows that the license issue 
above is real and a blocker.







On Fri, Dec 16, 2016 at 5:17 AM Reynold Xin  wrote:








Please vote on releasing the following candidate as Apache Spark version 
2.1.0. The vote is open until Sun, December 18, 2016 at 21:30 PT and 
passes if a majority of at least 3 +1 PMC votes are cast.







[ ] +1 Release this package as Apache Spark 2.1.0


[ ] -1 Do not release this package because ...












To learn more about Apache Spark, please see 

http://spark.apache.org/







The tag to be voted on is v2.1.0-rc5 
(cd0a08361e2526519e7c131c42116bf56fa62c76)







List of JIRA tickets resolved are:  

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-18 Thread Holden Karau
+1 (non-binding) - checked Python artifacts with virtual env.

On Sun, Dec 18, 2016 at 11:42 AM Denny Lee  wrote:

> +1 (non-binding)
>
>
> On Sat, Dec 17, 2016 at 11:45 PM Liwei Lin  wrote:
>
> +1
>
> Cheers,
> Liwei
>
>
>
> On Sat, Dec 17, 2016 at 10:29 AM, Yuming Wang  wrote:
>
> I hope https://github.com/apache/spark/pull/16252 can be fixed until
> release 2.1.0. It's a fix for broadcast cannot fit in memory.
>
> On Sat, Dec 17, 2016 at 10:23 AM, Joseph Bradley 
> wrote:
>
> +1
>
> On Fri, Dec 16, 2016 at 3:21 PM, Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com> wrote:
>
> +1
>
> On Sat, Dec 17, 2016 at 12:14 AM, Xiao Li  wrote:
>
> +1
>
> Xiao Li
>
> 2016-12-16 12:19 GMT-08:00 Felix Cheung :
>
>
>
>
>
>
>
>
>
>
>
>
>
> For R we have a license field in the DESCRIPTION, and this is standard
> practice (and requirement) for R packages.
>
>
>
>
>
>
>
> https://cran.r-project.org/doc/manuals/R-exts.html#Licensing
>
>
>
>
>
>
>
> --
>
>
> *From:* Sean Owen 
>
>
> *Sent:* Friday, December 16, 2016 9:57:15 AM
>
>
> *To:* Reynold Xin; dev@spark.apache.org
>
>
> *Subject:* Re: [VOTE] Apache Spark 2.1.0 (RC5)
>
>
>
>
>
>
>
>
>
>
> (If you have a template for these emails, maybe update it to use https
> links. They work for
>
> apache.org domains. After all we are asking people to verify the
> integrity of release artifacts, so it might as well be secure.)
>
>
>
>
>
>
>
> (Also the new archives use .tar.gz instead of .tgz like the others. No big
> deal, my OCD eye just noticed it.)
>
>
>
>
>
>
>
> I don't see an Apache license / notice for the Pyspark or SparkR
> artifacts. It would be good practice to include this in a convenience
> binary. I'm not sure if it's strictly mandatory, but something to adjust in
> any event. I think that's all there is to
>
> do for SparkR. For Pyspark, which packages a bunch of dependencies, it
> does include the licenses (good) but I think it should include the NOTICE
> file.
>
>
>
>
>
>
>
> This is the first time I recall getting 0 test failures off the bat!
>
>
> I'm using Java 8 / Ubuntu 16 and yarn/hive/hadoop-2.7 profiles.
>
>
>
>
>
>
>
> I think I'd +1 this therefore unless someone knows that the license issue
> above is real and a blocker.
>
>
>
>
>
>
>
> On Fri, Dec 16, 2016 at 5:17 AM Reynold Xin  wrote:
>
>
>
>
>
>
>
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.1.0. The vote is open until Sun, December 18, 2016 at 21:30 PT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
>
>
>
>
>
>
> [ ] +1 Release this package as Apache Spark 2.1.0
>
>
> [ ] -1 Do not release this package because ...
>
>
>
>
>
>
>
>
>
>
>
>
> To learn more about Apache Spark, please see
>
> http://spark.apache.org/
>
>
>
>
>
>
>
> The tag to be voted on is v2.1.0-rc5
> (cd0a08361e2526519e7c131c42116bf56fa62c76)
>
>
>
>
>
>
>
> List of JIRA tickets resolved are:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0
>
>
>
>
>
>
>
> The release files, including signatures, digests, etc. can be found at:
>
>
> http://home.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-bin/
>
>
>
>
>
>
>
> Release artifacts are signed with the following key:
>
>
> https://people.apache.org/keys/committer/pwendell.asc
>
>
>
>
>
>
>
> The staging repository for this release can be found at:
>
>
> https://repository.apache.org/content/repositories/orgapachespark-1223/
>
>
>
>
>
>
>
> The documentation corresponding to this release can be found at:
>
>
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-docs/
>
>
>
>
>
>
>
>
>
>
>
>
> *FAQ*
>
>
>
>
>
>
>
> *How can I help test this release?*
>
>
>
>
>
>
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
>
>
>
>
>
> *What should happen to JIRA tickets still targeting 2.1.0?*
>
>
>
>
>
>
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>
>
>
>
>
>
>
> *What happened to RC3/RC5?*
>
>
>
>
>
>
>
> They had issues withe release packaging and as a result were skipped.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
>
> Herman van Hövell
>
> Software Engineer
>
> Databricks Inc.
>
> hvanhov...@databricks.com
>
> +31 6 420 590 27
>
> databricks.com
>
> [image: http://databricks.com] 
>
>
>
>
>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>
>
>
>
>
>
>
>
>
>
>


Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-18 Thread Denny Lee
+1 (non-binding)


On Sat, Dec 17, 2016 at 11:45 PM Liwei Lin  wrote:

> +1
>
> Cheers,
> Liwei
>
> On Sat, Dec 17, 2016 at 10:29 AM, Yuming Wang  wrote:
>
> I hope https://github.com/apache/spark/pull/16252 can be fixed until
> release 2.1.0. It's a fix for broadcast cannot fit in memory.
>
> On Sat, Dec 17, 2016 at 10:23 AM, Joseph Bradley 
> wrote:
>
> +1
>
> On Fri, Dec 16, 2016 at 3:21 PM, Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com> wrote:
>
> +1
>
> On Sat, Dec 17, 2016 at 12:14 AM, Xiao Li  wrote:
>
> +1
>
> Xiao Li
>
> 2016-12-16 12:19 GMT-08:00 Felix Cheung :
>
> For R we have a license field in the DESCRIPTION, and this is standard
> practice (and requirement) for R packages.
>
> https://cran.r-project.org/doc/manuals/R-exts.html#Licensing
>
> --
> *From:* Sean Owen 
> *Sent:* Friday, December 16, 2016 9:57:15 AM
> *To:* Reynold Xin; dev@spark.apache.org
> *Subject:* Re: [VOTE] Apache Spark 2.1.0 (RC5)
>
> (If you have a template for these emails, maybe update it to use https
> links. They work for apache.org domains. After all we are asking people
> to verify the integrity of release artifacts, so it might as well be
> secure.)
>
> (Also the new archives use .tar.gz instead of .tgz like the others. No big
> deal, my OCD eye just noticed it.)
>
> I don't see an Apache license / notice for the Pyspark or SparkR
> artifacts. It would be good practice to include this in a convenience
> binary. I'm not sure if it's strictly mandatory, but something to adjust in
> any event. I think that's all there is to do for SparkR. For Pyspark, which
> packages a bunch of dependencies, it does include the licenses (good) but I
> think it should include the NOTICE file.
>
> This is the first time I recall getting 0 test failures off the bat!
> I'm using Java 8 / Ubuntu 16 and yarn/hive/hadoop-2.7 profiles.
>
> I think I'd +1 this therefore unless someone knows that the license issue
> above is real and a blocker.
>
> On Fri, Dec 16, 2016 at 5:17 AM Reynold Xin  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.1.0. The vote is open until Sun, December 18, 2016 at 21:30 PT and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.0-rc5
> (cd0a08361e2526519e7c131c42116bf56fa62c76)
>
> List of JIRA tickets resolved are:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.0
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1223/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc5-docs/
>
>
> *FAQ*
>
> *How can I help test this release?*
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> *What should happen to JIRA tickets still targeting 2.1.0?*
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>
> *What happened to RC3/RC5?*
>
> They had issues withe release packaging and as a result were skipped.
>
>
>
>
>
> --
>
> Herman van Hövell
>
> Software Engineer
>
> Databricks Inc.
>
> hvanhov...@databricks.com
>
> +31 6 420 590 27
>
> databricks.com
>
> [image: http://databricks.com] 
>
>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>
>
>
>


Re: Expand the Spark SQL programming guide?

2016-12-18 Thread Anton Okolnychyi
Any comments/suggestions are more than welcome.

Thanks,
Anton

2016-12-18 15:08 GMT+01:00 Anton Okolnychyi :

> Here is the pull request: https://github.com/apache/spark/pull/16329
>
>
>
> 2016-12-16 20:54 GMT+01:00 Jim Hughes :
>
>> I'd be happy to review a PR.  At the minute, I'm still learning Spark
>> SQL, so writing documentation might be a bit of a stretch, but reviewing
>> would be fine.
>>
>> Thanks!
>>
>>
>> On 12/16/2016 08:39 AM, Thakrar, Jayesh wrote:
>>
>> Yes - that sounds good Anton, I can work on documenting the window
>> functions.
>>
>>
>>
>> *From: *Anton Okolnychyi 
>> 
>> *Date: *Thursday, December 15, 2016 at 4:34 PM
>> *To: *Conversant 
>> 
>> *Cc: *Michael Armbrust  ,
>> Jim Hughes  , "dev@spark.apache.org"
>>   
>> *Subject: *Re: Expand the Spark SQL programming guide?
>>
>>
>>
>> I think it will make sense to show a sample implementation of
>> UserDefinedAggregateFunction for DataFrames, and an example of the
>> Aggregator API for typed Datasets.
>>
>>
>>
>> Jim, what if I submit a PR and you join the review process? I also do not
>> mind to split this if you want, but it seems to be an overkill for this
>> part.
>>
>>
>>
>> Jayesh, shall I skip the window functions part since you are going to
>> work on that?
>>
>>
>>
>> 2016-12-15 22:48 GMT+01:00 Thakrar, Jayesh 
>> :
>>
>> I too am interested in expanding the documentation for Spark SQL.
>>
>> For my work I needed to get some info/examples/guidance on window
>> functions and have been using https://databricks.com/blog/20
>> 15/07/15/introducing-window-functions-in-spark-sql.html .
>>
>> How about divide and conquer?
>>
>>
>>
>>
>>
>> *From: *Michael Armbrust 
>> *Date: *Thursday, December 15, 2016 at 3:21 PM
>> *To: *Jim Hughes < jn...@ccri.com>
>> *Cc: *"dev@spark.apache.org" 
>> *Subject: *Re: Expand the Spark SQL programming guide?
>>
>>
>>
>> Pull requests would be welcome for any major missing features in the
>> guide:
>> 
>> https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md
>>
>>
>>
>> On Thu, Dec 15, 2016 at 11:48 AM, Jim Hughes  wrote:
>>
>> Hi Anton,
>>
>> I'd like to see this as well.  I've been working on implementing
>> geospatial user-defined types and functions.  Having examples of
>> aggregations and window functions would be awesome!
>>
>> I did test out implementing a distributed convex hull as a
>> UserDefinedAggregateFunction, and that seemed to work sensibly.
>>
>> Cheers,
>>
>> Jim
>>
>>
>>
>> On 12/15/2016 03:28 AM, Anton Okolnychyi wrote:
>>
>> Hi,
>>
>>
>>
>> I am wondering whether it makes sense to expand the Spark SQL programming
>> guide with examples of aggregations (including user-defined via the
>> Aggregator API) and window functions.  For instance, there might be a
>> separate subsection under "Getting Started" for each functionality.
>>
>>
>>
>> SPARK-16046 seems to be related but there is no activity for more than 4
>> months.
>>
>>
>>
>> Best regards,
>>
>> Anton
>>
>>
>>
>>
>>
>>
>>
>>
>>
>


Re: Expand the Spark SQL programming guide?

2016-12-18 Thread Anton Okolnychyi
Here is the pull request: https://github.com/apache/spark/pull/16329



2016-12-16 20:54 GMT+01:00 Jim Hughes :

> I'd be happy to review a PR.  At the minute, I'm still learning Spark SQL,
> so writing documentation might be a bit of a stretch, but reviewing would
> be fine.
>
> Thanks!
>
>
> On 12/16/2016 08:39 AM, Thakrar, Jayesh wrote:
>
> Yes - that sounds good Anton, I can work on documenting the window
> functions.
>
>
>
> *From: *Anton Okolnychyi 
> 
> *Date: *Thursday, December 15, 2016 at 4:34 PM
> *To: *Conversant 
> 
> *Cc: *Michael Armbrust  ,
> Jim Hughes  , "dev@spark.apache.org"
>   
> *Subject: *Re: Expand the Spark SQL programming guide?
>
>
>
> I think it will make sense to show a sample implementation of
> UserDefinedAggregateFunction for DataFrames, and an example of the
> Aggregator API for typed Datasets.
>
>
>
> Jim, what if I submit a PR and you join the review process? I also do not
> mind to split this if you want, but it seems to be an overkill for this
> part.
>
>
>
> Jayesh, shall I skip the window functions part since you are going to work
> on that?
>
>
>
> 2016-12-15 22:48 GMT+01:00 Thakrar, Jayesh :
>
> I too am interested in expanding the documentation for Spark SQL.
>
> For my work I needed to get some info/examples/guidance on window
> functions and have been using https://databricks.com/blog/
> 2015/07/15/introducing-window-functions-in-spark-sql.html .
>
> How about divide and conquer?
>
>
>
>
>
> *From: *Michael Armbrust 
> *Date: *Thursday, December 15, 2016 at 3:21 PM
> *To: *Jim Hughes < jn...@ccri.com>
> *Cc: *"dev@spark.apache.org" 
> *Subject: *Re: Expand the Spark SQL programming guide?
>
>
>
> Pull requests would be welcome for any major missing features in the
> guide:
> 
> https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md
>
>
>
> On Thu, Dec 15, 2016 at 11:48 AM, Jim Hughes  wrote:
>
> Hi Anton,
>
> I'd like to see this as well.  I've been working on implementing
> geospatial user-defined types and functions.  Having examples of
> aggregations and window functions would be awesome!
>
> I did test out implementing a distributed convex hull as a
> UserDefinedAggregateFunction, and that seemed to work sensibly.
>
> Cheers,
>
> Jim
>
>
>
> On 12/15/2016 03:28 AM, Anton Okolnychyi wrote:
>
> Hi,
>
>
>
> I am wondering whether it makes sense to expand the Spark SQL programming
> guide with examples of aggregations (including user-defined via the
> Aggregator API) and window functions.  For instance, there might be a
> separate subsection under "Getting Started" for each functionality.
>
>
>
> SPARK-16046 seems to be related but there is no activity for more than 4
> months.
>
>
>
> Best regards,
>
> Anton
>
>
>
>
>
>
>
>
>