Re: How to join RDD keyValuePairs efficiently

Sean Owen Thu, 16 Apr 2015 09:03:13 -0700

This would be much, much faster if your set of IDs was simply a Set,
and you passed that to a filter() call that just filtered in the docs
that matched an ID in the set.


On Thu, Apr 16, 2015 at 4:51 PM, Wang, Ningjun (LNG-NPV)
<ningjun.w...@lexisnexis.com> wrote:
> Does anybody have a solution for this?
>
>
>
>
>
> From: Wang, Ningjun (LNG-NPV)
> Sent: Tuesday, April 14, 2015 10:41 AM
> To: user@spark.apache.org
> Subject: How to join RDD keyValuePairs efficiently
>
>
>
> I have an RDD that contains millions of Document objects. Each document has
> an unique Id that is a string. I need to find the documents by ids quickly.
> Currently I used RDD join as follow
>
>
>
> First I save the RDD as object file
>
>
>
> allDocs : RDD[Document] = getDocs()  // this RDD contains 7 million Document
> objects
>
> allDocs.saveAsObjectFile(“/temp/allDocs.obj”)
>
>
>
> Then I wrote a function to find documents by Ids
>
>
>
> def findDocumentsByIds(docids: RDD[String]) = {
>
> // docids contains less than 100 item
>
> val allDocs : RDD[Document] =sc.objectFile[Document]( (“/temp/allDocs.obj”)
>
> val idAndDocs = allDocs.keyBy(d => dv.id)
>
>     docids.map(id => (id,id)).join(idAndDocs).map(t => t._2._2)
>
> }
>
>
>
> I found that this is very slow. I suspect it scan the entire 7 million
> Document objects in “/temp/allDocs.obj” sequentially to find the desired
> document.
>
>
>
> Is there any efficient way to do this?
>
>
>
> One option I am thinking is that instead of storing the RDD[Document] as
> object file, I store each document in a separate file with filename equal to
> the docid. This way I can find a document quickly by docid. However this
> means I need to save the RDD to 7 million small file which will take a very
> long time to save and may cause IO problems with so many small files.
>
>
>
> Is there any other way?
>
>
>
>
>
>
>
> Ningjun

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to join RDD keyValuePairs efficiently

Reply via email to