Re: Performance improvements for sorted RDDs

Daniel Darabos Mon, 21 Mar 2016 08:21:05 -0700

There is related discussion in
https://issues.apache.org/jira/browse/SPARK-8836. It's not too hard to
implement this without modifying Spark and we measured ~10x improvement
over plain RDD joins. I haven't benchmarked against DataFrames -- maybe
they also realize this performance advantage.


On Mon, Mar 21, 2016 at 11:41 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Do you have performance numbers to backup this proposal for cogroup
> operation ?
>
> Thanks
>
> On Mon, Mar 21, 2016 at 1:06 AM, JOAQUIN GUANTER GONZALBEZ <
> joaquin.guantergonzal...@telefonica.com> wrote:
>
>> Hello devs,
>>
>>
>>
>> I have found myself in a situation where Spark is doing sub-optimal
>> computations for my RDDs, and I was wondering whether a patch to enable
>> improved performance for this scenario would be a welcome addition to Spark
>> or not.
>>
>>
>>
>> The scenario happens when trying to cogroup two RDDs that are sorted by
>> key and share the same partitioner. CoGroupedRDD will correctly detect that
>> the RDDs have the same partitioner and will therefore create narrow cogroup
>> split dependencies, as opposed to shuffle dependencies. This is great
>> because it prevents any shuffling from happening. However, the cogroup is
>> unable to detect that the RDDs are sorted in the same way, and will still
>> insert all elements of the RDD in a map in order to join the elements with
>> the same key.
>>
>>
>>
>> When both RDDs are sorted using the same order, the cogroup can just join
>> by doing a single pass over the data (since the data is ordered by key, you
>> can just keep iterating until you find a different key). This would greatly
>> reduce the memory requirements for these kind of operations.
>>
>>
>>
>> Adding this to spark would require adding an “ordering” member to RDD of
>> type Option[Ordering], similarly to how the “partitioner” field works. That
>> way, the sorting operations could populate this field and the operations
>> that could benefit from this knowledge (cogroup, join, groupbykey, etc.)
>> could read it to change their behavior accordingly.
>>
>>
>>
>> Do you think this would be a good addition to Spark?
>>
>>
>>
>> Thanks,
>>
>> Ximo
>>
>> ------------------------------
>>
>> Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
>> puede contener información privilegiada o confidencial y es para uso
>> exclusivo de la persona o entidad de destino. Si no es usted. el
>> destinatario indicado, queda notificado de que la lectura, utilización,
>> divulgación y/o copia sin autorización puede estar prohibida en virtud de
>> la legislación vigente. Si ha recibido este mensaje por error, le rogamos
>> que nos lo comunique inmediatamente por esta misma vía y proceda a su
>> destrucción.
>>
>> The information contained in this transmission is privileged and
>> confidential information intended only for the use of the individual or
>> entity named above. If the reader of this message is not the intended
>> recipient, you are hereby notified that any dissemination, distribution or
>> copying of this communication is strictly prohibited. If you have received
>> this transmission in error, do not read it. Please immediately reply to the
>> sender that you have received this communication in error and then delete
>> it.
>>
>> Esta mensagem e seus anexos se dirigem exclusivamente ao seu
>> destinatário, pode conter informação privilegiada ou confidencial e é para
>> uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o
>> destinatário indicado, fica notificado de que a leitura, utilização,
>> divulgação e/ou cópia sem autorização pode estar proibida em virtude da
>> legislação vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos
>> o comunique imediatamente por esta mesma via e proceda a sua destruição
>>
>
>

Re: Performance improvements for sorted RDDs

Reply via email to