[Spark Core]: Python and Scala generate different DAGs for identical code

2017-05-10 Thread pklemenkov
This Scala code: scala> val logs = sc.textFile("big_data_specialization/log.txt"). | filter(x => !x.contains("INFO")). | map(x => (x.split("\t")(1), 1)). | reduceByKey((x, y) => x + y) generated obvious lineage: (2) ShuffledRDD[4] at reduceByKey at :27 [] +-(2)

Benchmark of XGBoost, Vowpal Wabbit and Spark ML on Criteo 1TB Dataset

2017-05-03 Thread pklemenkov
Hi! We've done cool benchmark of popular ML libraries (including Spark ML) on Criteo 1TB dataset https://github.com/rambler-digital-solutions/criteo-1tb-benchmark Spark ML was tested on a real production cluster and showed great results at scale. We'd like to see some feedback and tips for