RE: Performance improvements for sorted RDDs

2016-03-21 Thread JOAQUIN GUANTER GONZALBEZ
[mailto:daniel.dara...@lynxanalytics.com] Enviado el: lunes, 21 de marzo de 2016 16:20 Para: Ted Yu <yuzhih...@gmail.com> CC: JOAQUIN GUANTER GONZALBEZ <joaquin.guantergonzal...@telefonica.com>; dev@spark.apache.org Asunto: Re: Performance improvements for sorted RDDs There is related discussi

Re: Performance improvements for sorted RDDs

2016-03-21 Thread Daniel Darabos
There is related discussion in https://issues.apache.org/jira/browse/SPARK-8836. It's not too hard to implement this without modifying Spark and we measured ~10x improvement over plain RDD joins. I haven't benchmarked against DataFrames -- maybe they also realize this performance advantage. On

Re: Performance improvements for sorted RDDs

2016-03-21 Thread Ted Yu
Do you have performance numbers to backup this proposal for cogroup operation ? Thanks On Mon, Mar 21, 2016 at 1:06 AM, JOAQUIN GUANTER GONZALBEZ < joaquin.guantergonzal...@telefonica.com> wrote: > Hello devs, > > > > I have found myself in a situation where Spark is doing sub-optimal >

Performance improvements for sorted RDDs

2016-03-21 Thread JOAQUIN GUANTER GONZALBEZ
Hello devs, I have found myself in a situation where Spark is doing sub-optimal computations for my RDDs, and I was wondering whether a patch to enable improved performance for this scenario would be a welcome addition to Spark or not. The scenario happens when trying to cogroup two RDDs that