sorry that was a typo. i meant to say: why do we have these features (broadcast join and sort-merge join) in DataFrame but not in RDD?
they don't seem specific to structured data analysis to me. thanks! koert On Sun, Sep 20, 2015 at 2:46 PM, Koert Kuipers <ko...@tresata.com> wrote: > why dont we want these (broadcast join and sort-merge join) in DataFrame > but not in RDD? > > they dont seem specific to structured data analysis to me. > > On Sun, Sep 20, 2015 at 2:41 AM, Rishitesh Mishra < > rishi80.mis...@gmail.com> wrote: > >> Got it..thnx Reynold.. >> On 20 Sep 2015 07:08, "Reynold Xin" <r...@databricks.com> wrote: >> >>> The RDDs themselves are not materialized, but the implementations can >>> materialize. >>> >>> E.g. in cogroup (which is used by RDD.join), it materializes all the >>> data during grouping. >>> >>> In SQL/DataFrame join, depending on the join: >>> >>> 1. For broadcast join, only the smaller side is materialized in memory >>> as a hash table. >>> >>> 2. For sort-merge join, both sides are sorted & streamed through -- >>> however, one of the sides need to buffer all the rows having the same join >>> key in order to perform the join. >>> >>> >>> >>> On Sat, Sep 19, 2015 at 12:55 PM, Rishitesh Mishra < >>> rishi80.mis...@gmail.com> wrote: >>> >>>> Hi Reynold, >>>> Can you please elaborate on this. I thought RDD also opens only an >>>> iterator. Does it get materialized for joins? >>>> >>>> Rishi >>>> >>>> On Saturday, September 19, 2015, Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>>> Yes for RDD -- both are materialized. No for DataFrame/SQL - one side >>>>> streams. >>>>> >>>>> >>>>> On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers <ko...@tresata.com> >>>>> wrote: >>>>> >>>>>> in scalding we join with the smaller side on the left, since the >>>>>> smaller side will get buffered while the bigger side streams through the >>>>>> join. >>>>>> >>>>>> looking at CoGroupedRDD i do not get the impression such a distiction >>>>>> is made. it seems both sided are put into a map that can spill to disk. >>>>>> is >>>>>> this correct? >>>>>> >>>>>> thanks >>>>>> >>>>> >>>>> >>> >