Hi RDD solution: >>> u = [(615,1),(720,1),(615,2)] >>> urdd=sc.parallelize(u,1) >>> a1 = [(615,'T'),(720,'C')] >>> ardd=sc.parallelize(a1,1) >>> def addString(s1,s2): ... return s1+','+s2 >>> j = urdd.join(ardd).map(lambda t:t[1]).reduceByKey(addString) >>> print j.collect() [(2, 'T'), (1, 'C,T')]
However, if you can assume <number of users> is far far greater than <number of distinct area codes>, you may think to broadcast variable in a dict format and look up in the map. Like this >>> u = [(1,615),(1,720),(2,615)] >>> a = {615:'T',720:'C'} >>> urdd=sc.parallelize(u) >>> def usr_area_state(tup): ... uid=tup[0] ... aid=tup[1] ... sid=bc.value[aid] ... return uid,(sid,) ... >>> bc=sc.broadcast(a) >>> usrdd=urdd.map(usr_area_state) >>> def addTuple(t1,t2): ... return t1+t2 ... >>> out=usrdd.reduceByKey(addTuple) >>> print out.collect() [(1, ('T', 'C')), (2, ('T',))] Best Ayan On Wed, Jul 22, 2015 at 5:14 PM, Sonal Goyal <sonalgoy...@gmail.com> wrote: > If I understand this correctly, you could join area_code_user and > area_code_state and then flat map to get > user, areacode, state. Then groupby/reduce by user. > > You can also try some join optimizations like partitioning on area code or > broadcasting smaller table depending on size of area_code_state. > On Jul 22, 2015 10:15 AM, "John Berryman" <jo...@eventbrite.com> wrote: > >> Quick example problem that's stumping me: >> >> * Users have 1 or more phone numbers and therefore one or more area codes. >> * There are 100M users. >> * States have one or more area codes. >> * I would like to the states for the users (as indicated by phone area >> code). >> >> I was thinking about something like this: >> >> If area_code_user looks like (area_code,[user_id]) ex: (615,[1234567]) >> and area_code_state looks like (area_code,state) ex: (615, ["Tennessee"]) >> then we could do >> >> states_and_users_mixed = area_code_user.join(area_code_state) \ >> .reduceByKey(lambda a,b: a+b) \ >> .values() >> >> user_state_pairs = states_and_users_mixed.flatMap( >> emit_cartesian_prod_of_userids_and_states ) >> user_to_states = user_state_pairs.reduceByKey(lambda a,b: a+b) >> >> user_to_states.first(1) >> >> >>> (1234567,["Tennessee","Tennessee","California"]) >> >> This would work, but the user_state_pairs is just a list of user_ids and >> state names mixed together and emit_cartesian_prod_of_userids_and_states >> has to correctly pair them. This is problematic because 1) it's weird and >> sloppy and 2) there will be lots of users per state and having so many >> users in a single row is going to make >> emit_cartesian_prod_of_userids_and_states work extra hard to first locate >> states and then emit all userid-state pairs. >> >> How should I be doing this? >> >> Thanks, >> -John >> > -- Best Regards, Ayan Guha