How to join RDD keyValuePairs efficiently

Wang, Ningjun (LNG-NPV) Tue, 14 Apr 2015 07:43:47 -0700

I have an RDD that contains millions of Document objects. Each document has an 
unique Id that is a string. I need to find the documents by ids quickly. 
Currently I used RDD join as follow


First I save the RDD as object file

allDocs : RDD[Document] = getDocs()  // this RDD contains 7 million Document 
objects
allDocs.saveAsObjectFile("/temp/allDocs.obj")

Then I wrote a function to find documents by Ids

def findDocumentsByIds(docids: RDD[String]) = {
// docids contains less than 100 item

val allDocs : RDD[Document] =sc.objectFile[Document]( ("/temp/allDocs.obj")

val idAndDocs = allDocs.keyBy(d => dv.id)
    docids.map(id => (id,id)).join(idAndDocs).map(t => t._2._2)
}

I found that this is very slow. I suspect it scan the entire 7 million Document 
objects in "/temp/allDocs.obj" sequentially to find the desired document.

Is there any efficient way to do this?

One option I am thinking is that instead of storing the RDD[Document] as object 
file, I store each document in a separate file with filename equal to the 
docid. This way I can find a document quickly by docid. However this means I 
need to save the RDD to 7 million small file which will take a very long time 
to save and may cause IO problems with so many small files.

Is there any other way?



Ningjun

How to join RDD keyValuePairs efficiently

Reply via email to