Re: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Sean Owen
for this? From: Wang, Ningjun (LNG-NPV) Sent: Tuesday, April 14, 2015 10:41 AM To: user@spark.apache.org Subject: How to join RDD keyValuePairs efficiently I have an RDD that contains millions of Document objects. Each document has an unique Id that is a string. I need to find the documents

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Wang, Ningjun (LNG-NPV)
Does anybody have a solution for this? From: Wang, Ningjun (LNG-NPV) Sent: Tuesday, April 14, 2015 10:41 AM To: user@spark.apache.org Subject: How to join RDD keyValuePairs efficiently I have an RDD that contains millions of Document objects. Each document has an unique Id that is a string. I

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
: user@spark.apache.org Subject: Re: How to join RDD keyValuePairs efficiently This would be much, much faster if your set of IDs was simply a Set, and you passed that to a filter() call that just filtered in the docs that matched an ID in the set. On Thu, Apr 16, 2015 at 4:51 PM, Wang, Ningjun (LNG

Re: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Akhil Das
You could try repartitioning your RDD using a custom partitioner (HashPartitioner etc) and caching the dataset into memory to speedup the joins. Thanks Best Regards On Tue, Apr 14, 2015 at 8:10 PM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I have an RDD that contains

Re: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Jeetendra Gangele
(LNG-NPV) [mailto:ningjun.w...@lexisnexis.com] Sent: Thursday, April 16, 2015 9:39 PM To: user@spark.apache.org Subject: RE: How to join RDD keyValuePairs efficiently Evo partition the large doc RDD based on the hash function on the key ie the docid What API to use to do

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
: RE: How to join RDD keyValuePairs efficiently Evo partition the large doc RDD based on the hash function on the key ie the docid What API to use to do this? By the way, loading the entire dataset to memory cause OutOfMemory problem because it is too large (I only have one machine

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Wang, Ningjun (LNG-NPV)
To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: How to join RDD keyValuePairs efficiently This would be much, much faster if your set of IDs was simply a Set, and you passed that to a filter() call that just filtered in the docs that matched an ID in the set. On Thu, Apr 16

RE: How to join RDD keyValuePairs efficiently

2015-04-16 Thread Evo Eftimov
Yes simply look for partitionby in the javadoc for e.g. PairJavaRDD From: Jeetendra Gangele [mailto:gangele...@gmail.com] Sent: Thursday, April 16, 2015 9:57 PM To: Evo Eftimov Cc: Wang, Ningjun (LNG-NPV); user Subject: Re: How to join RDD keyValuePairs efficiently Does this same

How to join RDD keyValuePairs efficiently

2015-04-14 Thread Wang, Ningjun (LNG-NPV)
I have an RDD that contains millions of Document objects. Each document has an unique Id that is a string. I need to find the documents by ids quickly. Currently I used RDD join as follow First I save the RDD as object file allDocs : RDD[Document] = getDocs() // this RDD contains 7 million