Do you have performance numbers to backup this proposal for cogroup
operation ?

Thanks

On Mon, Mar 21, 2016 at 1:06 AM, JOAQUIN GUANTER GONZALBEZ <
joaquin.guantergonzal...@telefonica.com> wrote:

> Hello devs,
>
>
>
> I have found myself in a situation where Spark is doing sub-optimal
> computations for my RDDs, and I was wondering whether a patch to enable
> improved performance for this scenario would be a welcome addition to Spark
> or not.
>
>
>
> The scenario happens when trying to cogroup two RDDs that are sorted by
> key and share the same partitioner. CoGroupedRDD will correctly detect that
> the RDDs have the same partitioner and will therefore create narrow cogroup
> split dependencies, as opposed to shuffle dependencies. This is great
> because it prevents any shuffling from happening. However, the cogroup is
> unable to detect that the RDDs are sorted in the same way, and will still
> insert all elements of the RDD in a map in order to join the elements with
> the same key.
>
>
>
> When both RDDs are sorted using the same order, the cogroup can just join
> by doing a single pass over the data (since the data is ordered by key, you
> can just keep iterating until you find a different key). This would greatly
> reduce the memory requirements for these kind of operations.
>
>
>
> Adding this to spark would require adding an “ordering” member to RDD of
> type Option[Ordering], similarly to how the “partitioner” field works. That
> way, the sorting operations could populate this field and the operations
> that could benefit from this knowledge (cogroup, join, groupbykey, etc.)
> could read it to change their behavior accordingly.
>
>
>
> Do you think this would be a good addition to Spark?
>
>
>
> Thanks,
>
> Ximo
>
> ------------------------------
>
> Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
> puede contener información privilegiada o confidencial y es para uso
> exclusivo de la persona o entidad de destino. Si no es usted. el
> destinatario indicado, queda notificado de que la lectura, utilización,
> divulgación y/o copia sin autorización puede estar prohibida en virtud de
> la legislación vigente. Si ha recibido este mensaje por error, le rogamos
> que nos lo comunique inmediatamente por esta misma vía y proceda a su
> destrucción.
>
> The information contained in this transmission is privileged and
> confidential information intended only for the use of the individual or
> entity named above. If the reader of this message is not the intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of this communication is strictly prohibited. If you have received
> this transmission in error, do not read it. Please immediately reply to the
> sender that you have received this communication in error and then delete
> it.
>
> Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário,
> pode conter informação privilegiada ou confidencial e é para uso exclusivo
> da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário
> indicado, fica notificado de que a leitura, utilização, divulgação e/ou
> cópia sem autorização pode estar proibida em virtude da legislação vigente.
> Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique
> imediatamente por esta mesma via e proceda a sua destruição
>

Reply via email to