Re: in joins, does one side stream?

Reynold Xin Sun, 20 Sep 2015 11:51:32 -0700

We do - but I don't think it is feasible to duplicate every single
algorithm in DF and in RDD.


The only way for this to work is to make one underlying implementation work
for both. Right now DataFrame knows how to serialize individual elements
well and can manage memory that way -- the RDD API doesn't give us enough
information for that.

https://issues.apache.org/jira/browse/SPARK-9999




On Sun, Sep 20, 2015 at 11:48 AM, Koert Kuipers <ko...@tresata.com> wrote:

> sorry that was a typo. i meant to say:
>
> why do we have these features (broadcast join and sort-merge join) in
> DataFrame but not in RDD?
>
> they don't seem specific to structured data analysis to me.
>
> thanks! koert
>
> On Sun, Sep 20, 2015 at 2:46 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> why dont we want these (broadcast join and sort-merge join) in DataFrame
>> but not in RDD?
>>
>> they dont seem specific to structured data analysis to me.
>>
>> On Sun, Sep 20, 2015 at 2:41 AM, Rishitesh Mishra <
>> rishi80.mis...@gmail.com> wrote:
>>
>>> Got it..thnx Reynold..
>>> On 20 Sep 2015 07:08, "Reynold Xin" <r...@databricks.com> wrote:
>>>
>>>> The RDDs themselves are not materialized, but the implementations can
>>>> materialize.
>>>>
>>>> E.g. in cogroup (which is used by RDD.join), it materializes all the
>>>> data during grouping.
>>>>
>>>> In SQL/DataFrame join, depending on the join:
>>>>
>>>> 1. For broadcast join, only the smaller side is materialized in memory
>>>> as a hash table.
>>>>
>>>> 2. For sort-merge join, both sides are sorted & streamed through --
>>>> however, one of the sides need to buffer all the rows having the same join
>>>> key in order to perform the join.
>>>>
>>>>
>>>>
>>>> On Sat, Sep 19, 2015 at 12:55 PM, Rishitesh Mishra <
>>>> rishi80.mis...@gmail.com> wrote:
>>>>
>>>>> Hi Reynold,
>>>>> Can you please elaborate on this. I thought RDD also opens only an
>>>>> iterator. Does it get materialized for joins?
>>>>>
>>>>> Rishi
>>>>>
>>>>> On Saturday, September 19, 2015, Reynold Xin <r...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Yes for RDD -- both are materialized. No for DataFrame/SQL - one side
>>>>>> streams.
>>>>>>
>>>>>>
>>>>>> On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers <ko...@tresata.com>
>>>>>> wrote:
>>>>>>
>>>>>>> in scalding we join with the smaller side on the left, since the
>>>>>>> smaller side will get buffered while the bigger side streams through the
>>>>>>> join.
>>>>>>>
>>>>>>> looking at CoGroupedRDD i do not get the impression such a
>>>>>>> distiction is made. it seems both sided are put into a map that can 
>>>>>>> spill
>>>>>>> to disk. is this correct?
>>>>>>>
>>>>>>> thanks
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>
>

Re: in joins, does one side stream?

Reply via email to