**Learning the ropes**

I'm trying to grasp the concept of using the pipeline in pySpark...

Simplified example:
>>>
list=[(1,"alpha"),(1,"beta"),(1,"foo"),(1,"alpha"),(2,"alpha"),(2,"alpha"),(2,"bar"),(3,"foo")]

Desired outcome:
[(1,3),(2,2),(3,1)]

Basically for each key, I want the number of unique values.

I've tried different approaches, but am I really using Spark effectively?
I wondered if I would do something like:
>>> input=sc.parallelize(list)
>>> input.groupByKey().collect()

Then I wondered if I could do something like a foreach over each key value,
and then map the actual values and reduce them.  Pseudo-code:

input.groupbykey()
.keys
.foreach(_.values
.map(lambda x: x,1)
.reducebykey(lambda a,b:a+b)
.count()
)

I was somehow hoping that the key would get the current value of count, and
thus be the count of the unique keys, which is exactly what I think I'm
looking for.

Am I way off base on how I could accomplish this?

Marco

Reply via email to