Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Sonal Goyal
You can look at ways to group records from both rdds together instead of doing Cartesian. Say generate pair rdd from each with first letter as key. Then do a partition and a join. On May 25, 2016 8:04 PM, "Priya Ch" wrote: > Hi, > RDD A is of size 30MB and RDD B

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Hi, RDD A is of size 30MB and RDD B is of size 8 MB. Upon matching, we would like to filter out the strings that have greater than 85% match and generate a score for it which is used in the susbsequent calculations. I tried generating pair rdd from both rdds A and B with same key for all the

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
Solr or Elastic search provide much more functionality and are faster in this context. The decision for or against them depends on your current and future use cases. Your current use case is still very abstract so in order to get a more proper recommendation you need to provide more details

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Why do i need to deploy solr for text anaytics...i have files placed in HDFS. just need to look for matches against each string in both files and generate those records whose match is > 85%. We trying to Fuzzy match logic. How can use map/reduce operations across 2 rdds ? Thanks, Padma Ch On

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
Alternatively depending on the exact use case you may employ solr on Hadoop for text analytics > On 25 May 2016, at 12:57, Priya Ch wrote: > > Lets say i have rdd A of strings as {"hi","bye","ch"} and another RDD B of > strings as

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
No this is not needed, look at the map / reduce operations and the standard spark word count > On 25 May 2016, at 12:57, Priya Ch wrote: > > Lets say i have rdd A of strings as {"hi","bye","ch"} and another RDD B of > strings as {"padma","hihi","chch","priya"}.

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Lets say i have rdd A of strings as {"hi","bye","ch"} and another RDD B of strings as {"padma","hihi","chch","priya"}. For every string rdd A i need to check the matches found in rdd B as such for string "hi" i have to check the matches against all strings in RDD B which means I need generate

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Jörn Franke
What is the use case of this ? A Cartesian product is by definition slow in any system. Why do you need this? How long does your application take now? > On 25 May 2016, at 12:42, Priya Ch wrote: > > I tried >

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
I tried dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even this is taking too much time. Thanks, Padma Ch On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro wrote: > Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as >

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Takeshi Yamamuro
Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as parquet, orc, ...? // maropu On Wed, May 25, 2016 at 7:10 PM, Priya Ch wrote: > Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I > am converting the joined dataframe

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I am converting the joined dataframe to rdd (dataframe.rdd) and using saveAsTextFile, trying to save it. However, this is also taking too much time. Thanks, Padma Ch On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Takeshi Yamamuro
Hi, Seems you'd be better off using DataFrame#join instead of RDD.cartesian because it always needs shuffle operations which have alot of overheads such as reflection, serialization, ... In your case, since the smaller table is 7mb, DataFrame#join uses a broadcast strategy. This is a little

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Mich Talebzadeh
It is basically a Cartesian join like RDBMS Example: SELECT * FROM FinancialCodes, FinancialData The results of this query matches every row in the FinancialCodes table with every row in the FinancialData table. Each row consists of all columns from the FinancialCodes table followed by all

Cartesian join on RDDs taking too much time

2016-05-25 Thread Priya Ch
Hi All, I have two RDDs A and B where in A is of size 30 MB and B is of size 7 MB, A.cartesian(B) is taking too much time. Is there any bottleneck in cartesian operation ? I am using spark 1.6.0 version Regards, Padma Ch