Hi, Seems you'd be better off using DataFrame#join instead of RDD.cartesian because it always needs shuffle operations which have alot of overheads such as reflection, serialization, ... In your case, since the smaller table is 7mb, DataFrame#join uses a broadcast strategy. This is a little more efficient than RDD.cartesian.
// maropu On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > It is basically a Cartesian join like RDBMS > > Example: > > SELECT * FROM FinancialCodes, FinancialData > > The results of this query matches every row in the FinancialCodes table > with every row in the FinancialData table. Each row consists of all > columns from the FinancialCodes table followed by all columns from the > FinancialData table. > > > Not very useful > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 25 May 2016 at 08:05, Priya Ch <learnings.chitt...@gmail.com> wrote: > >> Hi All, >> >> I have two RDDs A and B where in A is of size 30 MB and B is of size 7 >> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in >> cartesian operation ? >> >> I am using spark 1.6.0 version >> >> Regards, >> Padma Ch >> > > -- --- Takeshi Yamamuro