This Scala code:
scala> val logs = sc.textFile("big_data_specialization/log.txt").
| filter(x => !x.contains("INFO")).
| map(x => (x.split("\t")(1), 1)).
| reduceByKey((x, y) => x + y)
generated obvious lineage:
(2) ShuffledRDD[4] at reduceByKey at :27 []
+-(2)
Hi!
We've done cool benchmark of popular ML libraries (including Spark ML) on
Criteo 1TB dataset
https://github.com/rambler-digital-solutions/criteo-1tb-benchmark
Spark ML was tested on a real production cluster and showed great results at
scale.
We'd like to see some feedback and tips for