Re: many-to-many join

ayan guha Wed, 22 Jul 2015 02:41:39 -0700

Hi

RDD solution:
>>> u = [(615,1),(720,1),(615,2)]
>>> urdd=sc.parallelize(u,1)
>>> a1 = [(615,'T'),(720,'C')]
>>> ardd=sc.parallelize(a1,1)
>>> def addString(s1,s2):
...     return s1+','+s2
>>> j = urdd.join(ardd).map(lambda t:t[1]).reduceByKey(addString)
>>> print j.collect()
[(2, 'T'), (1, 'C,T')]


However, if you can assume <number of users>  is far far greater than
<number of distinct area codes>, you may think to broadcast variable in a
dict format and look up in the map. Like this

>>> u = [(1,615),(1,720),(2,615)]
>>> a = {615:'T',720:'C'}
>>> urdd=sc.parallelize(u)
>>> def usr_area_state(tup):
...     uid=tup[0]
...     aid=tup[1]
...     sid=bc.value[aid]
...     return uid,(sid,)
...
>>> bc=sc.broadcast(a)
>>> usrdd=urdd.map(usr_area_state)
>>> def addTuple(t1,t2):
...     return t1+t2
...
>>> out=usrdd.reduceByKey(addTuple)
>>> print out.collect()
[(1, ('T', 'C')), (2, ('T',))]

Best
Ayan

On Wed, Jul 22, 2015 at 5:14 PM, Sonal Goyal <sonalgoy...@gmail.com> wrote:

> If I understand this correctly, you could join area_code_user and
> area_code_state and then flat map to get
> user, areacode, state. Then groupby/reduce by user.
>
> You can also try some join optimizations like partitioning on area code or
> broadcasting smaller table depending on size of area_code_state.
> On Jul 22, 2015 10:15 AM, "John Berryman" <jo...@eventbrite.com> wrote:
>
>> Quick example problem that's stumping me:
>>
>> * Users have 1 or more phone numbers and therefore one or more area codes.
>> * There are 100M users.
>> * States have one or more area codes.
>> * I would like to the states for the users (as indicated by phone area
>> code).
>>
>> I was thinking about something like this:
>>
>> If area_code_user looks like (area_code,[user_id]) ex: (615,[1234567])
>> and area_code_state looks like (area_code,state) ex: (615, ["Tennessee"])
>> then we could do
>>
>> states_and_users_mixed = area_code_user.join(area_code_state) \
>>     .reduceByKey(lambda a,b: a+b) \
>>     .values()
>>
>> user_state_pairs = states_and_users_mixed.flatMap(
>>         emit_cartesian_prod_of_userids_and_states )
>> user_to_states =   user_state_pairs.reduceByKey(lambda a,b: a+b)
>>
>> user_to_states.first(1)
>>
>> >>> (1234567,["Tennessee","Tennessee","California"])
>>
>> This would work, but the user_state_pairs is just a list of user_ids and
>> state names mixed together and emit_cartesian_prod_of_userids_and_states
>> has to correctly pair them. This is problematic because 1) it's weird and
>> sloppy and 2) there will be lots of users per state and having so many
>> users in a single row is going to make
>> emit_cartesian_prod_of_userids_and_states work extra hard to first locate
>> states and then emit all userid-state pairs.
>>
>> How should I be doing this?
>>
>> Thanks,
>> -John
>>
>


-- 
Best Regards,
Ayan Guha

Re: many-to-many join

Reply via email to