You can look at ways to group records from both rdds together instead of doing Cartesian. Say generate pair rdd from each with first letter as key. Then do a partition and a join. On May 25, 2016 8:04 PM, "Priya Ch" <learnings.chitt...@gmail.com> wrote:
> Hi, > RDD A is of size 30MB and RDD B is of size 8 MB. Upon matching, we would > like to filter out the strings that have greater than 85% match and > generate a score for it which is used in the susbsequent calculations. > > I tried generating pair rdd from both rdds A and B with same key for all > the records. Now performing A.join(B) is also resulting in huge execution > time.. > > How do I go about with map and reduce here ? To generate pairs from 2 rdds > I dont think map can be used because we cannot have rdd inside another rdd. > > Would be glad if you can throw me some light on this. > > Thanks, > Padma Ch > > On Wed, May 25, 2016 at 7:39 PM, Jörn Franke <jornfra...@gmail.com> wrote: > >> Solr or Elastic search provide much more functionality and are faster in >> this context. The decision for or against them depends on your current and >> future use cases. Your current use case is still very abstract so in order >> to get a more proper recommendation you need to provide more details >> including size of dataset, what you do with the result of the matching do >> you just need the match number or also the pairs in the results etc. >> >> Your concrete problem can also be solved in Spark (though it is not the >> best and most efficient tool for this, but it has other strength) using the >> map reduce steps. There are different ways to implement this (Generate >> pairs from the input datasets in the map step or (maybe less recommendable) >> broadcast the smaller dataset to all nodes and do the matching with the >> bigger dataset there. >> This highly depends on the data in your data set. How they compare in >> size etc. >> >> >> >> On 25 May 2016, at 13:27, Priya Ch <learnings.chitt...@gmail.com> wrote: >> >> Why do i need to deploy solr for text anaytics...i have files placed in >> HDFS. just need to look for matches against each string in both files and >> generate those records whose match is > 85%. We trying to Fuzzy match >> logic. >> >> How can use map/reduce operations across 2 rdds ? >> >> Thanks, >> Padma Ch >> >> On Wed, May 25, 2016 at 4:49 PM, Jörn Franke <jornfra...@gmail.com> >> wrote: >> >>> >>> Alternatively depending on the exact use case you may employ solr on >>> Hadoop for text analytics >>> >>> > On 25 May 2016, at 12:57, Priya Ch <learnings.chitt...@gmail.com> >>> wrote: >>> > >>> > Lets say i have rdd A of strings as {"hi","bye","ch"} and another RDD >>> B of >>> > strings as {"padma","hihi","chch","priya"}. For every string rdd A i >>> need >>> > to check the matches found in rdd B as such for string "hi" i have to >>> check >>> > the matches against all strings in RDD B which means I need generate >>> every >>> > possible combination r >>> >> >> >