This would be much, much faster if your set of IDs was simply a Set, and you passed that to a filter() call that just filtered in the docs that matched an ID in the set.
On Thu, Apr 16, 2015 at 4:51 PM, Wang, Ningjun (LNG-NPV) <ningjun.w...@lexisnexis.com> wrote: > Does anybody have a solution for this? > > > > > > From: Wang, Ningjun (LNG-NPV) > Sent: Tuesday, April 14, 2015 10:41 AM > To: user@spark.apache.org > Subject: How to join RDD keyValuePairs efficiently > > > > I have an RDD that contains millions of Document objects. Each document has > an unique Id that is a string. I need to find the documents by ids quickly. > Currently I used RDD join as follow > > > > First I save the RDD as object file > > > > allDocs : RDD[Document] = getDocs() // this RDD contains 7 million Document > objects > > allDocs.saveAsObjectFile(“/temp/allDocs.obj”) > > > > Then I wrote a function to find documents by Ids > > > > def findDocumentsByIds(docids: RDD[String]) = { > > // docids contains less than 100 item > > val allDocs : RDD[Document] =sc.objectFile[Document]( (“/temp/allDocs.obj”) > > val idAndDocs = allDocs.keyBy(d => dv.id) > > docids.map(id => (id,id)).join(idAndDocs).map(t => t._2._2) > > } > > > > I found that this is very slow. I suspect it scan the entire 7 million > Document objects in “/temp/allDocs.obj” sequentially to find the desired > document. > > > > Is there any efficient way to do this? > > > > One option I am thinking is that instead of storing the RDD[Document] as > object file, I store each document in a separate file with filename equal to > the docid. This way I can find a document quickly by docid. However this > means I need to save the RDD to 7 million small file which will take a very > long time to save and may cause IO problems with so many small files. > > > > Is there any other way? > > > > > > > > Ningjun --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org