Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Erik Erlandson
cating it in 2.4. I’d also consider looking >> at what other data science tools are doing before fully removing it: for >> example, if Pandas and TensorFlow no longer support Python 2 past some >> point, that might be a good point to remove it. >> > >> > Matei &

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Erik Erlandson
FWIW, Pandas is dropping Py2 support at the end of this year. Tensorflow is less clear. They only support py3 on windows, but there is no reference to any policy about py2 on their roadmap or the

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Erik Erlandson
ate Py2 already in the 2.4.0 release. > > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson > wrote: > >> In case this didn't make it onto this thread: >> >> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and >> remove it entirely on a later 3.x re

Re: Should python-2 be supported in Spark 3.0?

2018-09-15 Thread Erik Erlandson
In case this didn't make it onto this thread: There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it entirely on a later 3.x release. On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson wrote: > On a separate dev@spark thread, I raised a question of whet

Should python-2 be supported in Spark 3.0?

2018-09-15 Thread Erik Erlandson
On a separate dev@spark thread, I raised a question of whether or not to support python 2 in Apache Spark, going forward into Spark 3.0. Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking

Announcing Spark on Kubernetes release 0.4.0

2017-09-25 Thread Erik Erlandson
The Spark on Kubernetes development community is pleased to announce release 0.4.0 of Apache Spark with native Kubernetes scheduler back-end! The dev community is planning to use this release as the reference for upstreaming native kubernetes capability over the Spark 2.3 release cycle. This

Announcing isarn-sketches-spark v0.2.0 with pyspark support

2017-09-02 Thread Erik Erlandson
Release 0.2.0 for isarn-sketches spark is available! - pyspark support - pyc files consumable from package jars - cross-publishing for spark and python versions - UDAFs for reducing t-digest columns and/or groups via groupBy https://github.com/isarn/isarn-sketches-spark

Apache Spark on Kubernetes: New Release for Spark 2.2

2017-08-14 Thread Erik Erlandson
The Apache Spark on Kubernetes Community Development Project is pleased to announce the latest release of Apache Spark with native Scheduler Backend for Kubernetes! Features provided in this release include: - Cluster-mode submission of Spark jobs to a Kubernetes cluster - Support

UDAFs for sketching Dataset columns with T-Digests

2017-07-05 Thread Erik Erlandson
After my talk on T-Digests in Spark at Spark Summit East, there were some requests for a UDAF-based interface for working with Datasets. I'm pleased to announce that I released a library for doing T-Digest sketching with UDAFs: https://github.com/isarn/isarn-sketches-spark This initial release

Spark on Kubernetes: Birds-of-a-Feather Session 12:50pm 6/6 @ Spark Summit

2017-06-05 Thread Erik Erlandson
Come learn about the community development project to add a native Kubernetes scheduling back-end to Apache Spark! Meet contributors and network with community members interested in running Spark on Kubernetes. Learn how to run Spark jobs on your Kubernetes cluster; find out how to contribute to

An Apache Spark metric sink for Kafka

2017-04-18 Thread Erik Erlandson
I wrote up a simple metric sink for Spark that publishes metrics to a Kafka broker. Each metric is published as a message (in json format), with the metric name as the message key. https://github.com/erikerlandson/spark-kafka-sink Build with "(x)sbt assembly" and make sure the resulting jar

Re: Efficient for loops in Spark

2016-05-16 Thread Erik Erlandson
Regarding the specific problem of generating random folds in a more efficient way, this should help: http://silex.freevariable.com/latest/api/#com.redhat.et.silex.sample.split.SplitSampleRDDFunctions It uses a sort of multiplexing formalism on RDDs:

Re: Spark serialization in closure

2015-07-09 Thread Erik Erlandson
I think you have stumbled across this idiosyncrasy: http://erikerlandson.github.io/blog/2015/03/31/hygienic-closures-for-scala-function-serialization/ - Original Message - I am not sure this is more of a question for Spark or just Scala but I am posting my question here. The code

Re: Streaming K-medoids

2015-06-01 Thread Erik Erlandson
I haven't given any thought to streaming it, but in case it's useful I do have a k-medoids implementation for Spark: http://silex.freevariable.com/latest/api/#com.redhat.et.silex.cluster.KMedoids Also a blog post about multi-threading it:

Re: removing first record from RDD[String]

2014-12-23 Thread Erik Erlandson
There is also a lazy implementation: http://erikerlandson.github.io/blog/2014/07/29/deferring-spark-actions-to-lazy-transforms-with-the-promise-rdd/ I generated a PR for it -- there was also an alternate proposal for having it be a library in the new Spark Packages site: