Re: How to work with a joined rdd in pyspark?

2015-11-30 Thread arnalone
ahhh I get it thx!! I did not know that we can use "double index" I used x[0] to point on shows, x[1][0] to point on channels x[1][1] to point on views. I feel terribly noob. Thank you all :) -- View this message in context:

Re: How to work with a joined rdd in pyspark?

2015-11-29 Thread Gylfi
Hi. Your code is like this right? "/joined_dataset = show_channel.join(show_views) joined_dataset.take(4)/" well /joined_dataset / is now an array (because you used /.take(4)/ ). So it does not support any RDD operations.. Could that be the problem? Otherwise more code is needed to

Re: How to work with a joined rdd in pyspark?

2015-11-29 Thread arnalone
Thanks for replying so fast! it was not clear. my code is : joined_dataset = show_channel.join(show_views) for your knowledge, the first lines are joined_dataset.take(4) Out[93]: [(u'PostModern_Cooking', (u'DEF', 1038)), (u'PostModern_Cooking', (u'DEF', 415)), (u'PostModern_Cooking', (u'DEF',

Re: How to work with a joined rdd in pyspark?

2015-11-29 Thread Gylfi
Hi. Can't you do a filter, to get only the ABC shows, map that into a keyed instance of the show, and then do a reduceByKey to sum up the views? Something like this in Scala code: /filter for the channel new pair (show, view count) / val myAnswer = joined_dataset.filter( _._2._1 == "ABC"

Re: How to work with a joined rdd in pyspark?

2015-11-29 Thread arnalone
Yes that 's what I am trying to do, but I do not manage to "point" on the channel field to filter on "ABC" and then in the map step to get only shows and views. In scala you do it with (_._2._1 == "ABC") and (_._1, _._2._2), but I don't find the right syntax in python to do the same :( -- View

Re: How to work with a joined rdd in pyspark?

2015-11-29 Thread Gylfi
Can't you just access it by element, like with [0] and [1] ? http://www.tutorialspoint.com/python/python_tuples.htm -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-work-with-a-joined-rdd-in-pyspark-tp25510p25517.html Sent from the Apache Spark