Re: in joins, does one side stream?

Rishitesh Mishra Sat, 19 Sep 2015 23:42:00 -0700

Got it..thnx Reynold..
On 20 Sep 2015 07:08, "Reynold Xin" <r...@databricks.com> wrote:


> The RDDs themselves are not materialized, but the implementations can
> materialize.
>
> E.g. in cogroup (which is used by RDD.join), it materializes all the data
> during grouping.
>
> In SQL/DataFrame join, depending on the join:
>
> 1. For broadcast join, only the smaller side is materialized in memory as
> a hash table.
>
> 2. For sort-merge join, both sides are sorted & streamed through --
> however, one of the sides need to buffer all the rows having the same join
> key in order to perform the join.
>
>
>
> On Sat, Sep 19, 2015 at 12:55 PM, Rishitesh Mishra <
> rishi80.mis...@gmail.com> wrote:
>
>> Hi Reynold,
>> Can you please elaborate on this. I thought RDD also opens only an
>> iterator. Does it get materialized for joins?
>>
>> Rishi
>>
>> On Saturday, September 19, 2015, Reynold Xin <r...@databricks.com> wrote:
>>
>>> Yes for RDD -- both are materialized. No for DataFrame/SQL - one side
>>> streams.
>>>
>>>
>>> On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> in scalding we join with the smaller side on the left, since the
>>>> smaller side will get buffered while the bigger side streams through the
>>>> join.
>>>>
>>>> looking at CoGroupedRDD i do not get the impression such a distiction
>>>> is made. it seems both sided are put into a map that can spill to disk. is
>>>> this correct?
>>>>
>>>> thanks
>>>>
>>>
>>>
>

Re: in joins, does one side stream?

Reply via email to