sorry that was a typo. i meant to say:

why do we have these features (broadcast join and sort-merge join) in
DataFrame but not in RDD?

they don't seem specific to structured data analysis to me.

thanks! koert

On Sun, Sep 20, 2015 at 2:46 PM, Koert Kuipers <ko...@tresata.com> wrote:

> why dont we want these (broadcast join and sort-merge join) in DataFrame
> but not in RDD?
>
> they dont seem specific to structured data analysis to me.
>
> On Sun, Sep 20, 2015 at 2:41 AM, Rishitesh Mishra <
> rishi80.mis...@gmail.com> wrote:
>
>> Got it..thnx Reynold..
>> On 20 Sep 2015 07:08, "Reynold Xin" <r...@databricks.com> wrote:
>>
>>> The RDDs themselves are not materialized, but the implementations can
>>> materialize.
>>>
>>> E.g. in cogroup (which is used by RDD.join), it materializes all the
>>> data during grouping.
>>>
>>> In SQL/DataFrame join, depending on the join:
>>>
>>> 1. For broadcast join, only the smaller side is materialized in memory
>>> as a hash table.
>>>
>>> 2. For sort-merge join, both sides are sorted & streamed through --
>>> however, one of the sides need to buffer all the rows having the same join
>>> key in order to perform the join.
>>>
>>>
>>>
>>> On Sat, Sep 19, 2015 at 12:55 PM, Rishitesh Mishra <
>>> rishi80.mis...@gmail.com> wrote:
>>>
>>>> Hi Reynold,
>>>> Can you please elaborate on this. I thought RDD also opens only an
>>>> iterator. Does it get materialized for joins?
>>>>
>>>> Rishi
>>>>
>>>> On Saturday, September 19, 2015, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> Yes for RDD -- both are materialized. No for DataFrame/SQL - one side
>>>>> streams.
>>>>>
>>>>>
>>>>> On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers <ko...@tresata.com>
>>>>> wrote:
>>>>>
>>>>>> in scalding we join with the smaller side on the left, since the
>>>>>> smaller side will get buffered while the bigger side streams through the
>>>>>> join.
>>>>>>
>>>>>> looking at CoGroupedRDD i do not get the impression such a distiction
>>>>>> is made. it seems both sided are put into a map that can spill to disk. 
>>>>>> is
>>>>>> this correct?
>>>>>>
>>>>>> thanks
>>>>>>
>>>>>
>>>>>
>>>
>

Reply via email to