How about this.. map it to key,value pair, then reducebykey using max operation Then in the rdd you can do join with your lookup data & reduce (if you only wanna lookup 2 values then you canuse lookup directly as well). PS: these are list of operations in Scala, I am not aware how far pyspark api is in those.
Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Tue, Jun 24, 2014 at 3:33 AM, Aaron <aaron.doss...@target.com> wrote: > Sorry, I got my sample outputs wrong > > (1,1) -> 400 > (1,2) -> 500 > (2,2)-> 600 > > On Jun 23, 2014, at 4:29 PM, "Aaron Dossett [via Apache Spark User List]" > <[hidden > email] <http://user/SendEmail.jtp?type=node&node=8145&i=0>> wrote: > > I am relatively new to Spark and am getting stuck trying to do the > following: > > - My input is integer key, value pairs where the key is not unique. I'm > interested in information about all possible distinct key combinations, > thus the Cartesian product. > - My first attempt was to create a separate RDD of this cartesian product > and then use map() to calculate the data. However, I was trying to pass > another RDD to the function map was calling, which I eventually figured out > was causing a run time error, even if the function I called with map did > nothing. Here's a simple code example: > > ------- > def somefunc(x, y, RDD): > return 0 > > input = sc.parallelize([(1,100), (1,200), (2, 100), (2,300)]) > > #Create all pairs of keys, including self-pairs > itemPairs = input.map(lambda x: x[0]).distinct() > itemPairs = itemPairs.cartesian(itemPairs) > > print itemPairs.collect() > > TC = itemPairs.map(lambda x: (x, somefunc(x[0], x[1], input))) > > print TC.collect() > ------ > > I'm assuming this isn't working because it isn't a very Spark-like way to > do things and I could imagine that passing RDDs into other RDD's map > functions might not make sense. Could someone suggest to me a way to apply > transformations and actions to "input" that would produce a mapping of key > pairs to some information related to the values. > > For example, I might want to (1, 2) to map to the sum of the maximum > values found for each key in the input (500 in my sample data above). > Extending that example (1,1) would map to 300 and (2,2) to 400. > > Please let me know if I should provide more details or a more robust > example. > > Thank you, Aaron > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/Efficiently-doing-an-analysis-with-Cartesian-product-pyspark-tp8144.html > This email was sent by Aaron Dossett > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=1353> > (via Nabble) > To receive all replies by email, subscribe to this discussion > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=subscribe_by_code&node=8144&code=YWFyb24uZG9zc2V0dEB0YXJnZXQuY29tfDgxNDR8MTM3NjcxOTg5> > > > ------------------------------ > View this message in context: Re: Efficiently doing an analysis with > Cartesian product (pyspark) > <http://apache-spark-user-list.1001560.n3.nabble.com/Efficiently-doing-an-analysis-with-Cartesian-product-pyspark-tp8144p8145.html> > > Sent from the Apache Spark User List mailing list archive > <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >