There is related discussion in https://issues.apache.org/jira/browse/SPARK-8836. It's not too hard to implement this without modifying Spark and we measured ~10x improvement over plain RDD joins. I haven't benchmarked against DataFrames -- maybe they also realize this performance advantage.
On Mon, Mar 21, 2016 at 11:41 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Do you have performance numbers to backup this proposal for cogroup > operation ? > > Thanks > > On Mon, Mar 21, 2016 at 1:06 AM, JOAQUIN GUANTER GONZALBEZ < > joaquin.guantergonzal...@telefonica.com> wrote: > >> Hello devs, >> >> >> >> I have found myself in a situation where Spark is doing sub-optimal >> computations for my RDDs, and I was wondering whether a patch to enable >> improved performance for this scenario would be a welcome addition to Spark >> or not. >> >> >> >> The scenario happens when trying to cogroup two RDDs that are sorted by >> key and share the same partitioner. CoGroupedRDD will correctly detect that >> the RDDs have the same partitioner and will therefore create narrow cogroup >> split dependencies, as opposed to shuffle dependencies. This is great >> because it prevents any shuffling from happening. However, the cogroup is >> unable to detect that the RDDs are sorted in the same way, and will still >> insert all elements of the RDD in a map in order to join the elements with >> the same key. >> >> >> >> When both RDDs are sorted using the same order, the cogroup can just join >> by doing a single pass over the data (since the data is ordered by key, you >> can just keep iterating until you find a different key). This would greatly >> reduce the memory requirements for these kind of operations. >> >> >> >> Adding this to spark would require adding an “ordering” member to RDD of >> type Option[Ordering], similarly to how the “partitioner” field works. That >> way, the sorting operations could populate this field and the operations >> that could benefit from this knowledge (cogroup, join, groupbykey, etc.) >> could read it to change their behavior accordingly. >> >> >> >> Do you think this would be a good addition to Spark? >> >> >> >> Thanks, >> >> Ximo >> >> ------------------------------ >> >> Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, >> puede contener información privilegiada o confidencial y es para uso >> exclusivo de la persona o entidad de destino. Si no es usted. el >> destinatario indicado, queda notificado de que la lectura, utilización, >> divulgación y/o copia sin autorización puede estar prohibida en virtud de >> la legislación vigente. Si ha recibido este mensaje por error, le rogamos >> que nos lo comunique inmediatamente por esta misma vía y proceda a su >> destrucción. >> >> The information contained in this transmission is privileged and >> confidential information intended only for the use of the individual or >> entity named above. If the reader of this message is not the intended >> recipient, you are hereby notified that any dissemination, distribution or >> copying of this communication is strictly prohibited. If you have received >> this transmission in error, do not read it. Please immediately reply to the >> sender that you have received this communication in error and then delete >> it. >> >> Esta mensagem e seus anexos se dirigem exclusivamente ao seu >> destinatário, pode conter informação privilegiada ou confidencial e é para >> uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o >> destinatário indicado, fica notificado de que a leitura, utilização, >> divulgação e/ou cópia sem autorização pode estar proibida em virtude da >> legislação vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos >> o comunique imediatamente por esta mesma via e proceda a sua destruição >> > >