Re: How to do nested for-each loops across RDDs ?

2016-08-15 Thread Eric Ho
Thanks Daniel.
Do you have any code fragments on using CoGroups or Joins across 2 RDDs ?
I don't think that index would help much because this is an N x M
operation, examining each cell of each RDD.  Each comparison is complex as
it needs to peer into a complex JSON


On Mon, Aug 15, 2016 at 1:24 PM, Daniel Imberman 
wrote:

> There's no real way of doing nested for-loops with RDD's because the whole
> idea is that you could have so much data in the RDD that it would be really
> ugly to store it all in one worker.
>
> There are, however, ways to handle what you're asking about.
>
> I would personally use something like CoGroup or Join between the two
> RDDs. if index matters, you can use ZipWithIndex on both before you join
> and then see which indexes match up.
>
> On Mon, Aug 15, 2016 at 1:15 PM Eric Ho  wrote:
>
>> I've nested foreach loops like this:
>>
>>   for i in A[i] do:
>> for j in B[j] do:
>>   append B[j] to some list if B[j] 'matches' A[i] in some fashion.
>>
>> Each element in A or B is some complex structure like:
>> (
>>   some complex JSON,
>>   some number
>> )
>>
>> Question: if A and B were represented as RRDs (e.g. RRD(A) and RRD(B)),
>> how would my code look ?
>> Are there any RRD operators that would allow me to loop thru both RRDs
>> like the above procedural code ?
>> I can't find any RRD operators nor any code fragments that would allow me
>> to do this.
>>
>> Thing is: by that time I composed RRD(A), this RRD would have contain
>> elements in array B as well as array A.
>> Same argument for RRD(B).
>>
>> Any pointers much appreciated.
>>
>> Thanks.
>>
>>
>> --
>>
>> -eric ho
>>
>>


-- 

-eric ho


Re: How to do nested for-each loops across RDDs ?

2016-08-15 Thread Daniel Imberman
There's no real way of doing nested for-loops with RDD's because the whole
idea is that you could have so much data in the RDD that it would be really
ugly to store it all in one worker.

There are, however, ways to handle what you're asking about.

I would personally use something like CoGroup or Join between the two RDDs.
if index matters, you can use ZipWithIndex on both before you join and then
see which indexes match up.

On Mon, Aug 15, 2016 at 1:15 PM Eric Ho  wrote:

> I've nested foreach loops like this:
>
>   for i in A[i] do:
> for j in B[j] do:
>   append B[j] to some list if B[j] 'matches' A[i] in some fashion.
>
> Each element in A or B is some complex structure like:
> (
>   some complex JSON,
>   some number
> )
>
> Question: if A and B were represented as RRDs (e.g. RRD(A) and RRD(B)),
> how would my code look ?
> Are there any RRD operators that would allow me to loop thru both RRDs
> like the above procedural code ?
> I can't find any RRD operators nor any code fragments that would allow me
> to do this.
>
> Thing is: by that time I composed RRD(A), this RRD would have contain
> elements in array B as well as array A.
> Same argument for RRD(B).
>
> Any pointers much appreciated.
>
> Thanks.
>
>
> --
>
> -eric ho
>
>