Re: Choosing RDD/DataFrame/DataSet and Cluster Tuning

2016-07-23 Thread Pedro Rodriguez
Hi Jestin, Spark is smart about how it does joins. In this case, if df2 is sufficiently small it will do a broadcast join. Basically, rather than shuffle df1/df2 for a join, it broadcasts df2 to all workers and joins locally. Looks like you may already have known that though based on using the

Choosing RDD/DataFrame/DataSet and Cluster Tuning

2016-07-23 Thread Jestin Ma
Hello, Right now I'm using DataFrames to perform a df1.groupBy(key).count() on one DataFrame and join with another, df2. The first, df1, is very large (many gigabytes) compared to df2 (250 Mb). Right now I'm running this on a cluster of 5 nodes, 16 cores each, 90 GB RAM each. It is taking me