Re: Efficiently doing an analysis with Cartesian product (pyspark)

Mayur Rustagi Tue, 24 Jun 2014 15:38:27 -0700

How about this..
map it to key,value pair, then reducebykey using max operation
Then in the rdd you can do join with your lookup data & reduce (if you only
wanna lookup 2 values then you canuse lookup directly as well).
PS: these are list of operations in Scala, I am not aware how far pyspark
api is in those.


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Tue, Jun 24, 2014 at 3:33 AM, Aaron <aaron.doss...@target.com> wrote:

> Sorry, I got my sample outputs wrong
>
>  (1,1) -> 400
> (1,2) -> 500
> (2,2)-> 600
>
> On Jun 23, 2014, at 4:29 PM, "Aaron Dossett [via Apache Spark User List]" 
> <[hidden
> email] <http://user/SendEmail.jtp?type=node&node=8145&i=0>> wrote:
>
>  I am relatively new to Spark and am getting stuck trying to do the
> following:
>
> - My input is integer key, value pairs where the key is not unique.  I'm
> interested in information about all possible distinct key combinations,
> thus the Cartesian product.
> - My first attempt was to create a separate RDD of this cartesian product
> and then use map() to calculate the data.  However, I was trying to pass
> another RDD to the function map was calling, which I eventually figured out
> was causing a run time error, even if the function I called with map did
> nothing.  Here's a simple code example:
>
> -------
> def somefunc(x, y, RDD):
>   return 0
>
> input = sc.parallelize([(1,100), (1,200), (2, 100), (2,300)])
>
> #Create all pairs of keys, including self-pairs
> itemPairs = input.map(lambda x: x[0]).distinct()
> itemPairs = itemPairs.cartesian(itemPairs)
>
> print itemPairs.collect()
>
> TC = itemPairs.map(lambda x: (x, somefunc(x[0], x[1], input)))
>
> print TC.collect()
> ------
>
> I'm assuming this isn't working because it isn't a very Spark-like way to
> do things and I could imagine that passing RDDs into other RDD's map
> functions might not make sense.  Could someone suggest to me a way to apply
> transformations and actions to "input" that would produce a mapping of key
> pairs to some information related to the values.
>
> For example, I might want to (1, 2) to map to the sum of the maximum
> values found for each key in the input (500 in my sample data above).
>  Extending that example (1,1) would map to 300 and (2,2) to 400.
>
> Please let me know if I should provide more details or a more robust
> example.
>
> Thank you, Aaron
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Efficiently-doing-an-analysis-with-Cartesian-product-pyspark-tp8144.html
>  This email was sent by Aaron Dossett
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=1353>
> (via Nabble)
> To receive all replies by email, subscribe to this discussion
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=subscribe_by_code&node=8144&code=YWFyb24uZG9zc2V0dEB0YXJnZXQuY29tfDgxNDR8MTM3NjcxOTg5>
>
>
> ------------------------------
> View this message in context: Re: Efficiently doing an analysis with
> Cartesian product (pyspark)
> <http://apache-spark-user-list.1001560.n3.nabble.com/Efficiently-doing-an-analysis-with-Cartesian-product-pyspark-tp8144p8145.html>
>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>

Re: Efficiently doing an analysis with Cartesian product (pyspark)

Reply via email to