How about this..
map it to key,value pair, then reducebykey using max operation
Then in the rdd you can do join with your lookup data & reduce (if you only
wanna lookup 2 values then you canuse lookup directly as well).
PS: these are list of operations in Scala, I am not aware how far pyspark
api is in those.

Mayur Rustagi
Ph: +1 (760) 203 3257
@mayur_rustagi <>

On Tue, Jun 24, 2014 at 3:33 AM, Aaron <> wrote:

> Sorry, I got my sample outputs wrong
>  (1,1) -> 400
> (1,2) -> 500
> (2,2)-> 600
> On Jun 23, 2014, at 4:29 PM, "Aaron Dossett [via Apache Spark User List]" 
> <[hidden
> email] <http://user/SendEmail.jtp?type=node&node=8145&i=0>> wrote:
>  I am relatively new to Spark and am getting stuck trying to do the
> following:
> - My input is integer key, value pairs where the key is not unique.  I'm
> interested in information about all possible distinct key combinations,
> thus the Cartesian product.
> - My first attempt was to create a separate RDD of this cartesian product
> and then use map() to calculate the data.  However, I was trying to pass
> another RDD to the function map was calling, which I eventually figured out
> was causing a run time error, even if the function I called with map did
> nothing.  Here's a simple code example:
> -------
> def somefunc(x, y, RDD):
>   return 0
> input = sc.parallelize([(1,100), (1,200), (2, 100), (2,300)])
> #Create all pairs of keys, including self-pairs
> itemPairs = x: x[0]).distinct()
> itemPairs = itemPairs.cartesian(itemPairs)
> print itemPairs.collect()
> TC = x: (x, somefunc(x[0], x[1], input)))
> print TC.collect()
> ------
> I'm assuming this isn't working because it isn't a very Spark-like way to
> do things and I could imagine that passing RDDs into other RDD's map
> functions might not make sense.  Could someone suggest to me a way to apply
> transformations and actions to "input" that would produce a mapping of key
> pairs to some information related to the values.
> For example, I might want to (1, 2) to map to the sum of the maximum
> values found for each key in the input (500 in my sample data above).
>  Extending that example (1,1) would map to 300 and (2,2) to 400.
> Please let me know if I should provide more details or a more robust
> example.
> Thank you, Aaron
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>  This email was sent by Aaron Dossett
> <>
> (via Nabble)
> To receive all replies by email, subscribe to this discussion
> <>
> ------------------------------
> View this message in context: Re: Efficiently doing an analysis with
> Cartesian product (pyspark)
> <>
> Sent from the Apache Spark User List mailing list archive
> <> at

Reply via email to