Re: ML Pipelines in R

2018-05-22 Thread Hossein
Correction: the SPIP is https://issues.apache.org/jira/browse/SPARK-24359 --Hossein On Tue, May 22, 2018 at 6:23 PM, Hossein wrote: > Hi all, > > SparkR supports calling MLlib functionality with an R-friendly API. Since > Spark 1.5 the (new) SparkML API which is based on

ML Pipelines in R

2018-05-22 Thread Hossein
Hi all, SparkR supports calling MLlib functionality with an R-friendly API. Since Spark 1.5 the (new) SparkML API which is based on pipelines and parameters has matured significantly. It allows users build and maintain complicated machine learning pipelines. A lot of this functionality is

Re: [VOTE] Spark 2.3.1 (RC2)

2018-05-22 Thread Marcelo Vanzin
Starting with my own +1. Did the same testing as RC1. On Tue, May 22, 2018 at 12:45 PM, Marcelo Vanzin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.3.1. > > The vote is open until Friday, May 25, at 20:00 UTC and passes if > at least

[VOTE] Spark 2.3.1 (RC2)

2018-05-22 Thread Marcelo Vanzin
Please vote on releasing the following candidate as Apache Spark version 2.3.1. The vote is open until Friday, May 25, at 20:00 UTC and passes if at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.3.1 [ ] -1 Do not release this package because ... To learn more

Repeated FileSourceScanExec.metrics from ColumnarBatchScan.metrics

2018-05-22 Thread Jacek Laskowski
Hi, I'm wondering why are the metrics repeated in FileSourceScanExec.metrics [1] since it is a ColumnarBatchScan [2] and so inherits the two metrics numOutputRows and scanTime from ColumnarBatchScan.metrics [3]. Shouldn't FileSourceScanExec.metrics be as follows then: override lazy val

Re: Running lint-java during PR builds?

2018-05-22 Thread Hyukjin Kwon
I opened a PR - https://github.com/apache/spark/pull/21399 to run it with SBT. 2018-05-22 2:18 GMT+08:00 Reynold Xin : > Can we look into if there is a plugin for sbt that works and then we can > put everything into one single builder? > > On Mon, May 21, 2018 at 11:17 AM

Re: Revisiting Online serving of Spark models?

2018-05-22 Thread Saikat Kanjilal
I’m in the same exact boat as Maximiliano and have use cases as well for model serving and would love to join this discussion. Sent from my iPhone On May 22, 2018, at 6:39 AM, Maximiliano Felice > wrote: Hi! I'm don't usually

Re: Revisiting Online serving of Spark models?

2018-05-22 Thread Maximiliano Felice
Hi! I'm don't usually write a lot on this list but I keep up to date with the discussions and I'm a heavy user of Spark. This topic caught my attention, as we're currently facing this issue at work. I'm attending to the summit and was wondering if it would it be possible for me to join that

Re: Revisiting Online serving of Spark models?

2018-05-22 Thread Leif Walsh
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it. On Mon, May 21, 2018 at 16:52 Joseph Bradley

Re: Sort-merge join improvement

2018-05-22 Thread Petar Zecevic
Hi, we went through a round of reviews on this PR. Performance improvements can be substantial and there are unit and performance tests included. One remark was that the amount of changed code is large but I don't see how to reduce it and still keep the performance improvements. Besides,