Directly broadcasting (sort of) RDDs

Guillaume Pitel Fri, 20 Mar 2015 15:24:52 -0700

Hi,

I have an idea that I would like to discuss with the Spark devs. Theidea comes from a very real problem that I have struggled with sincealmost a year. My problem is very simple, it's a dense matrix * sparsematrix operation. I have a dense matrix RDD[(Int,FloatMatrix)] which isdivided in X large blocks (one block per partition), and a sparse matrixRDD[((Int,Int),Array[Array[(Int,Float)]]] , divided in X * Y blocks. Themost efficient way to perform the operation is to collectAsMap() thedense matrix and broadcast it, then perform the block-localmutliplications, and combine the results by column.

This is quite fine, unless the matrix is too big to fit in memory(especially since the multiplication is performed several timesiteratively, and the broadcasts are not always cleaned from memory as Iwould naively expect).

When the dense matrix is too big, a second solution is to split the bigsparse matrix in several RDD, and do several broadcasts. Doing thiscreates quite a big overhead, but it mostly works, even though I oftenface some problems with unaccessible broadcast files, for instance.

Then there is the terrible but apparently very effective good old join.Since X blocks of the sparse matrix use the same block from the densematrix, I suspect that the dense matrix is somehow replicated X times(either on disk or in the network), which is the reason why the jointakes so much time.

After this bit of a context, here is my idea : would it be possible tosomehow "broadcast" (or maybe more accurately, share or serve) apersisted RDD which is distributed on all workers, in a way that would,a bit like the IndexedRDD, allow a task to access a partition or anelement of a partition in the closure, with a worker-local memory cache. i.e. the information about where each block resides would bedistributed on the workers, to allow them to access parts of the RDDdirectly. I think that's already a bit how RDD are shuffled ?

The RDD could stay distributed (no need to collect then broadcast), andonly necessary transfers would be required.

Is this a bad idea, is it already implemented somewhere (I would love it!) ?or is it something that could add efficiency not only for my usecase, but maybe for others ? Could someone give me some hint about how Icould add this possibility to Spark ? I would probably try to extend aRDD into a specific SharedIndexedRDD with a special lookup that would beallowed from tasks as a special case, and that would try to contact theblockManager and reach the corresponding data from the right worker.


Thanks in advance for your advices

Guillaume
--
eXenSa

        
*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. <http://www.exensa.com/>
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705

Directly broadcasting (sort of) RDDs

Reply via email to