Got it..thnx Reynold.. On 20 Sep 2015 07:08, "Reynold Xin" <r...@databricks.com> wrote:
> The RDDs themselves are not materialized, but the implementations can > materialize. > > E.g. in cogroup (which is used by RDD.join), it materializes all the data > during grouping. > > In SQL/DataFrame join, depending on the join: > > 1. For broadcast join, only the smaller side is materialized in memory as > a hash table. > > 2. For sort-merge join, both sides are sorted & streamed through -- > however, one of the sides need to buffer all the rows having the same join > key in order to perform the join. > > > > On Sat, Sep 19, 2015 at 12:55 PM, Rishitesh Mishra < > rishi80.mis...@gmail.com> wrote: > >> Hi Reynold, >> Can you please elaborate on this. I thought RDD also opens only an >> iterator. Does it get materialized for joins? >> >> Rishi >> >> On Saturday, September 19, 2015, Reynold Xin <r...@databricks.com> wrote: >> >>> Yes for RDD -- both are materialized. No for DataFrame/SQL - one side >>> streams. >>> >>> >>> On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers <ko...@tresata.com> >>> wrote: >>> >>>> in scalding we join with the smaller side on the left, since the >>>> smaller side will get buffered while the bigger side streams through the >>>> join. >>>> >>>> looking at CoGroupedRDD i do not get the impression such a distiction >>>> is made. it seems both sided are put into a map that can spill to disk. is >>>> this correct? >>>> >>>> thanks >>>> >>> >>> >