Decrease shuffle in TreeAggregate with coalesce ?

2016-04-27 Thread Guillaume Pitel
Hi, I've been looking at the code of RDD.treeAggregate, because we've seen a huge performance drop between 1.5.2 and 1.6.1 on a treeReduce. I think the treeAggregate code hasn't changed, so my message is not about the performance drop, but a more general remark about treeAggregate. In treeAg

Duplicated fit into TrainValidationSplit

2016-04-27 Thread Dirceu Semighini Filho
Hi guys, I was testing a pipeline here, and found a possible duplicated call to fit method into the org.apache.spark.ml.tuning.TrainValidationSplit

Re: Duplicated fit into TrainValidationSplit

2016-04-27 Thread Nick Pentreath
You should find that the first set of fits are called on the training set, and the resulting models evaluated on the validation set. The final best model is then retrained on the entire dataset. This is standard practice - usually the dataset passed to the train validation split is itself further s

Re: Duplicated fit into TrainValidationSplit

2016-04-27 Thread Dirceu Semighini Filho
Ok, thank you. 2016-04-27 11:37 GMT-03:00 Nick Pentreath : > You should find that the first set of fits are called on the training set, > and the resulting models evaluated on the validation set. The final best > model is then retrained on the entire dataset. This is standard practice - > usually

Re: [build system] short downtime wednesday morning (4-27-16), 7-9am

2016-04-27 Thread shane knapp
this will be postponed due to the 2.0 code freeze. sorry for the late notice. On Mon, Apr 25, 2016 at 4:50 PM, shane knapp wrote: > another project hosted on our jenkins (e-mission) needs anaconda scipy > upgraded from 0.15.1 to 0.17.0. this will also upgrade a few other > libs, which i've incl

Re: [build system] short downtime wednesday morning (4-27-16), 7-9am

2016-04-27 Thread shane knapp
we're going to go ahead and do this on monday. i'll send out another email later this week w/the details. On Wed, Apr 27, 2016 at 8:50 AM, shane knapp wrote: > this will be postponed due to the 2.0 code freeze. sorry for the late notice. > > On Mon, Apr 25, 2016 at 4:50 PM, shane knapp wrote:

Re: HDFS as Shuffle Service

2016-04-27 Thread Steve Loughran
> On 27 Apr 2016, at 04:59, Takeshi Yamamuro wrote: > > Hi, all > > See SPARK-1529 for related discussion. > > // maropu I'd not seen that discussion. I'm actually curious about why the 15% diff in performance between Java NIO and Hadoop FS APIs, and, if it is the case (Hadoop still uses t

Re: Decrease shuffle in TreeAggregate with coalesce ?

2016-04-27 Thread Joseph Bradley
Do you have code which can reproduce this performance drop in treeReduce? It would be helpful to debug. In the 1.6 release, we profiled it via the various MLlib algorithms and did not see performance drops. It's not just renumbering the partitions; it is reducing the number of partitions by a fac

Error running spark-sql-perf version 0.3.2 against Spark 1.6

2016-04-27 Thread Michael Slavitch
Hello; I'm trying to run spark-sql-perf version 0.3.2 (hash cb0347b) against Spark 1.6, I get the following when running ./bin/run --benchmark DatsetPerformance Exception in thread "main" java.lang.ClassNotFoundException: com.databricks.spark.sql.perf.DatsetPerformance Even though the cl

Re: HDFS as Shuffle Service

2016-04-27 Thread Michael Gummelt
> Are you suggesting to have shuffle service persist and fetch data with hdfs, or skip shuffle service altogether and just write to hdfs? Skip shuffle service altogether. Write to HDFS. Mesos environments tend to be multi-tenant, and running the shuffle service on all nodes could be extremely wa