Re: Cartesian join on RDDs taking too much time

Priya Ch Wed, 25 May 2016 07:35:29 -0700

Hi,
  RDD A is of size 30MB and RDD B is of size 8 MB. Upon matching, we would
like to filter out the strings that have greater than 85% match and
generate a score for it which is used in the susbsequent calculations.


I tried generating pair rdd from both rdds A and B with same key for all
the records. Now performing A.join(B) is also resulting in huge execution
time..

How do I go about with map and reduce here ? To generate pairs from 2 rdds
I dont think map can be used because we cannot have rdd inside another rdd.

Would be glad if you can throw me some light on this.

Thanks,
Padma Ch

On Wed, May 25, 2016 at 7:39 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Solr or Elastic search provide much more functionality and are faster in
> this context. The decision for or against them depends on your current and
> future use cases. Your current use case is still very abstract so in order
> to get a more proper recommendation you need to provide more details
> including size of dataset, what you do with the result of the matching do
> you just need the match number or also the pairs in the results etc.
>
> Your concrete problem can also be solved in Spark (though it is not the
> best and most efficient tool for this, but it has other strength) using the
> map reduce steps. There are different ways to implement this (Generate
> pairs from the input datasets in the map step or (maybe less recommendable)
> broadcast the smaller dataset to all nodes and do the matching with the
> bigger dataset there.
> This highly depends on the data in your data set. How they compare in size
> etc.
>
>
>
> On 25 May 2016, at 13:27, Priya Ch <learnings.chitt...@gmail.com> wrote:
>
> Why do i need to deploy solr for text anaytics...i have files placed in
> HDFS. just need to look for matches against each string in both files and
> generate those records whose match is > 85%. We trying to Fuzzy match
> logic.
>
> How can use map/reduce operations across 2 rdds ?
>
> Thanks,
> Padma Ch
>
> On Wed, May 25, 2016 at 4:49 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>>
>> Alternatively depending on the exact use case you may employ solr on
>> Hadoop for text analytics
>>
>> > On 25 May 2016, at 12:57, Priya Ch <learnings.chitt...@gmail.com>
>> wrote:
>> >
>> > Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD
>> B of
>> > strings as {"padma","hihi","chch","priya"}. For every string rdd A i
>> need
>> > to check the matches found in rdd B as such for string "hi" i have to
>> check
>> > the matches against all strings in RDD B which means I need generate
>> every
>> > possible combination r
>>
>
>

Re: Cartesian join on RDDs taking too much time

Reply via email to