Re: in joins, does one side stream?

2015-09-20 Thread Rishitesh Mishra
Got it..thnx Reynold..
On 20 Sep 2015 07:08, "Reynold Xin"  wrote:

> The RDDs themselves are not materialized, but the implementations can
> materialize.
>
> E.g. in cogroup (which is used by RDD.join), it materializes all the data
> during grouping.
>
> In SQL/DataFrame join, depending on the join:
>
> 1. For broadcast join, only the smaller side is materialized in memory as
> a hash table.
>
> 2. For sort-merge join, both sides are sorted & streamed through --
> however, one of the sides need to buffer all the rows having the same join
> key in order to perform the join.
>
>
>
> On Sat, Sep 19, 2015 at 12:55 PM, Rishitesh Mishra <
> rishi80.mis...@gmail.com> wrote:
>
>> Hi Reynold,
>> Can you please elaborate on this. I thought RDD also opens only an
>> iterator. Does it get materialized for joins?
>>
>> Rishi
>>
>> On Saturday, September 19, 2015, Reynold Xin  wrote:
>>
>>> Yes for RDD -- both are materialized. No for DataFrame/SQL - one side
>>> streams.
>>>
>>>
>>> On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers 
>>> wrote:
>>>
 in scalding we join with the smaller side on the left, since the
 smaller side will get buffered while the bigger side streams through the
 join.

 looking at CoGroupedRDD i do not get the impression such a distiction
 is made. it seems both sided are put into a map that can spill to disk. is
 this correct?

 thanks

>>>
>>>
>


Re: in joins, does one side stream?

2015-09-20 Thread Reynold Xin
We do - but I don't think it is feasible to duplicate every single
algorithm in DF and in RDD.

The only way for this to work is to make one underlying implementation work
for both. Right now DataFrame knows how to serialize individual elements
well and can manage memory that way -- the RDD API doesn't give us enough
information for that.

https://issues.apache.org/jira/browse/SPARK-




On Sun, Sep 20, 2015 at 11:48 AM, Koert Kuipers  wrote:

> sorry that was a typo. i meant to say:
>
> why do we have these features (broadcast join and sort-merge join) in
> DataFrame but not in RDD?
>
> they don't seem specific to structured data analysis to me.
>
> thanks! koert
>
> On Sun, Sep 20, 2015 at 2:46 PM, Koert Kuipers  wrote:
>
>> why dont we want these (broadcast join and sort-merge join) in DataFrame
>> but not in RDD?
>>
>> they dont seem specific to structured data analysis to me.
>>
>> On Sun, Sep 20, 2015 at 2:41 AM, Rishitesh Mishra <
>> rishi80.mis...@gmail.com> wrote:
>>
>>> Got it..thnx Reynold..
>>> On 20 Sep 2015 07:08, "Reynold Xin"  wrote:
>>>
 The RDDs themselves are not materialized, but the implementations can
 materialize.

 E.g. in cogroup (which is used by RDD.join), it materializes all the
 data during grouping.

 In SQL/DataFrame join, depending on the join:

 1. For broadcast join, only the smaller side is materialized in memory
 as a hash table.

 2. For sort-merge join, both sides are sorted & streamed through --
 however, one of the sides need to buffer all the rows having the same join
 key in order to perform the join.



 On Sat, Sep 19, 2015 at 12:55 PM, Rishitesh Mishra <
 rishi80.mis...@gmail.com> wrote:

> Hi Reynold,
> Can you please elaborate on this. I thought RDD also opens only an
> iterator. Does it get materialized for joins?
>
> Rishi
>
> On Saturday, September 19, 2015, Reynold Xin 
> wrote:
>
>> Yes for RDD -- both are materialized. No for DataFrame/SQL - one side
>> streams.
>>
>>
>> On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers 
>> wrote:
>>
>>> in scalding we join with the smaller side on the left, since the
>>> smaller side will get buffered while the bigger side streams through the
>>> join.
>>>
>>> looking at CoGroupedRDD i do not get the impression such a
>>> distiction is made. it seems both sided are put into a map that can 
>>> spill
>>> to disk. is this correct?
>>>
>>> thanks
>>>
>>
>>

>>
>


Re: in joins, does one side stream?

2015-09-20 Thread Koert Kuipers
why dont we want these (broadcast join and sort-merge join) in DataFrame
but not in RDD?

they dont seem specific to structured data analysis to me.

On Sun, Sep 20, 2015 at 2:41 AM, Rishitesh Mishra 
wrote:

> Got it..thnx Reynold..
> On 20 Sep 2015 07:08, "Reynold Xin"  wrote:
>
>> The RDDs themselves are not materialized, but the implementations can
>> materialize.
>>
>> E.g. in cogroup (which is used by RDD.join), it materializes all the data
>> during grouping.
>>
>> In SQL/DataFrame join, depending on the join:
>>
>> 1. For broadcast join, only the smaller side is materialized in memory as
>> a hash table.
>>
>> 2. For sort-merge join, both sides are sorted & streamed through --
>> however, one of the sides need to buffer all the rows having the same join
>> key in order to perform the join.
>>
>>
>>
>> On Sat, Sep 19, 2015 at 12:55 PM, Rishitesh Mishra <
>> rishi80.mis...@gmail.com> wrote:
>>
>>> Hi Reynold,
>>> Can you please elaborate on this. I thought RDD also opens only an
>>> iterator. Does it get materialized for joins?
>>>
>>> Rishi
>>>
>>> On Saturday, September 19, 2015, Reynold Xin 
>>> wrote:
>>>
 Yes for RDD -- both are materialized. No for DataFrame/SQL - one side
 streams.


 On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers 
 wrote:

> in scalding we join with the smaller side on the left, since the
> smaller side will get buffered while the bigger side streams through the
> join.
>
> looking at CoGroupedRDD i do not get the impression such a distiction
> is made. it seems both sided are put into a map that can spill to disk. is
> this correct?
>
> thanks
>


>>


Re: in joins, does one side stream?

2015-09-20 Thread Koert Kuipers
sorry that was a typo. i meant to say:

why do we have these features (broadcast join and sort-merge join) in
DataFrame but not in RDD?

they don't seem specific to structured data analysis to me.

thanks! koert

On Sun, Sep 20, 2015 at 2:46 PM, Koert Kuipers  wrote:

> why dont we want these (broadcast join and sort-merge join) in DataFrame
> but not in RDD?
>
> they dont seem specific to structured data analysis to me.
>
> On Sun, Sep 20, 2015 at 2:41 AM, Rishitesh Mishra <
> rishi80.mis...@gmail.com> wrote:
>
>> Got it..thnx Reynold..
>> On 20 Sep 2015 07:08, "Reynold Xin"  wrote:
>>
>>> The RDDs themselves are not materialized, but the implementations can
>>> materialize.
>>>
>>> E.g. in cogroup (which is used by RDD.join), it materializes all the
>>> data during grouping.
>>>
>>> In SQL/DataFrame join, depending on the join:
>>>
>>> 1. For broadcast join, only the smaller side is materialized in memory
>>> as a hash table.
>>>
>>> 2. For sort-merge join, both sides are sorted & streamed through --
>>> however, one of the sides need to buffer all the rows having the same join
>>> key in order to perform the join.
>>>
>>>
>>>
>>> On Sat, Sep 19, 2015 at 12:55 PM, Rishitesh Mishra <
>>> rishi80.mis...@gmail.com> wrote:
>>>
 Hi Reynold,
 Can you please elaborate on this. I thought RDD also opens only an
 iterator. Does it get materialized for joins?

 Rishi

 On Saturday, September 19, 2015, Reynold Xin 
 wrote:

> Yes for RDD -- both are materialized. No for DataFrame/SQL - one side
> streams.
>
>
> On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers 
> wrote:
>
>> in scalding we join with the smaller side on the left, since the
>> smaller side will get buffered while the bigger side streams through the
>> join.
>>
>> looking at CoGroupedRDD i do not get the impression such a distiction
>> is made. it seems both sided are put into a map that can spill to disk. 
>> is
>> this correct?
>>
>> thanks
>>
>
>
>>>
>


Re: in joins, does one side stream?

2015-09-19 Thread Rishitesh Mishra
Hi Reynold,
Can you please elaborate on this. I thought RDD also opens only an
iterator. Does it get materialized for joins?

Rishi

On Saturday, September 19, 2015, Reynold Xin  wrote:

> Yes for RDD -- both are materialized. No for DataFrame/SQL - one side
> streams.
>
>
> On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers  > wrote:
>
>> in scalding we join with the smaller side on the left, since the smaller
>> side will get buffered while the bigger side streams through the join.
>>
>> looking at CoGroupedRDD i do not get the impression such a distiction is
>> made. it seems both sided are put into a map that can spill to disk. is
>> this correct?
>>
>> thanks
>>
>
>


Re: in joins, does one side stream?

2015-09-19 Thread Reynold Xin
The RDDs themselves are not materialized, but the implementations can
materialize.

E.g. in cogroup (which is used by RDD.join), it materializes all the data
during grouping.

In SQL/DataFrame join, depending on the join:

1. For broadcast join, only the smaller side is materialized in memory as a
hash table.

2. For sort-merge join, both sides are sorted & streamed through --
however, one of the sides need to buffer all the rows having the same join
key in order to perform the join.



On Sat, Sep 19, 2015 at 12:55 PM, Rishitesh Mishra  wrote:

> Hi Reynold,
> Can you please elaborate on this. I thought RDD also opens only an
> iterator. Does it get materialized for joins?
>
> Rishi
>
> On Saturday, September 19, 2015, Reynold Xin  wrote:
>
>> Yes for RDD -- both are materialized. No for DataFrame/SQL - one side
>> streams.
>>
>>
>> On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers 
>> wrote:
>>
>>> in scalding we join with the smaller side on the left, since the smaller
>>> side will get buffered while the bigger side streams through the join.
>>>
>>> looking at CoGroupedRDD i do not get the impression such a distiction is
>>> made. it seems both sided are put into a map that can spill to disk. is
>>> this correct?
>>>
>>> thanks
>>>
>>
>>


Re: in joins, does one side stream?

2015-09-18 Thread Reynold Xin
Yes for RDD -- both are materialized. No for DataFrame/SQL - one side
streams.


On Thu, Sep 17, 2015 at 11:21 AM, Koert Kuipers  wrote:

> in scalding we join with the smaller side on the left, since the smaller
> side will get buffered while the bigger side streams through the join.
>
> looking at CoGroupedRDD i do not get the impression such a distiction is
> made. it seems both sided are put into a map that can spill to disk. is
> this correct?
>
> thanks
>