Re: Cartesian join on RDDs taking too much time

Jörn Franke Wed, 25 May 2016 07:12:12 -0700

Solr or Elastic search provide much more functionality and are faster in this 
context. The decision for or against them depends on your current and future 
use cases. Your current use case is still very abstract so in order to get a 
more proper recommendation you need to provide more details including size of 
dataset, what you do with the result of the matching do you just need the match 
number or also the pairs in the results etc.

Your concrete problem can also be solved in Spark (though it is not the best 
and most efficient tool for this, but it has other strength) using the map 
reduce steps. There are different ways to implement this (Generate pairs from 
the input datasets in the map step or (maybe less recommendable) broadcast the 
smaller dataset to all nodes and do the matching with the bigger dataset there.
This highly depends on the data in your data set. How they compare in size etc.

> On 25 May 2016, at 13:27, Priya Ch <learnings.chitt...@gmail.com> wrote:
> 
> Why do i need to deploy solr for text anaytics...i have files placed in HDFS. 
> just need to look for matches against each string in both files and generate 
> those records whose match is > 85%. We trying to Fuzzy match logic. 
> 
> How can use map/reduce operations across 2 rdds ?
> 
> Thanks,
> Padma Ch
> 
>> On Wed, May 25, 2016 at 4:49 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>> 
>> Alternatively depending on the exact use case you may employ solr on Hadoop 
>> for text analytics
>> 
>> > On 25 May 2016, at 12:57, Priya Ch <learnings.chitt...@gmail.com> wrote:
>> >
>> > Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD B of
>> > strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
>> > to check the matches found in rdd B as such for string "hi" i have to check
>> > the matches against all strings in RDD B which means I need generate every
>> > possible combination r
>

Re: Cartesian join on RDDs taking too much time

Reply via email to