Do you have performance numbers to backup this proposal for cogroup operation ?
Thanks On Mon, Mar 21, 2016 at 1:06 AM, JOAQUIN GUANTER GONZALBEZ < joaquin.guantergonzal...@telefonica.com> wrote: > Hello devs, > > > > I have found myself in a situation where Spark is doing sub-optimal > computations for my RDDs, and I was wondering whether a patch to enable > improved performance for this scenario would be a welcome addition to Spark > or not. > > > > The scenario happens when trying to cogroup two RDDs that are sorted by > key and share the same partitioner. CoGroupedRDD will correctly detect that > the RDDs have the same partitioner and will therefore create narrow cogroup > split dependencies, as opposed to shuffle dependencies. This is great > because it prevents any shuffling from happening. However, the cogroup is > unable to detect that the RDDs are sorted in the same way, and will still > insert all elements of the RDD in a map in order to join the elements with > the same key. > > > > When both RDDs are sorted using the same order, the cogroup can just join > by doing a single pass over the data (since the data is ordered by key, you > can just keep iterating until you find a different key). This would greatly > reduce the memory requirements for these kind of operations. > > > > Adding this to spark would require adding an “ordering” member to RDD of > type Option[Ordering], similarly to how the “partitioner” field works. That > way, the sorting operations could populate this field and the operations > that could benefit from this knowledge (cogroup, join, groupbykey, etc.) > could read it to change their behavior accordingly. > > > > Do you think this would be a good addition to Spark? > > > > Thanks, > > Ximo > > ------------------------------ > > Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, > puede contener información privilegiada o confidencial y es para uso > exclusivo de la persona o entidad de destino. Si no es usted. el > destinatario indicado, queda notificado de que la lectura, utilización, > divulgación y/o copia sin autorización puede estar prohibida en virtud de > la legislación vigente. Si ha recibido este mensaje por error, le rogamos > que nos lo comunique inmediatamente por esta misma vía y proceda a su > destrucción. > > The information contained in this transmission is privileged and > confidential information intended only for the use of the individual or > entity named above. If the reader of this message is not the intended > recipient, you are hereby notified that any dissemination, distribution or > copying of this communication is strictly prohibited. If you have received > this transmission in error, do not read it. Please immediately reply to the > sender that you have received this communication in error and then delete > it. > > Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, > pode conter informação privilegiada ou confidencial e é para uso exclusivo > da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário > indicado, fica notificado de que a leitura, utilização, divulgação e/ou > cópia sem autorização pode estar proibida em virtude da legislação vigente. > Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique > imediatamente por esta mesma via e proceda a sua destruição >