**Learning the ropes**

I'm trying to grasp the concept of using the pipeline in pySpark...

Simplified example:

Desired outcome:

Basically for each key, I want the number of unique values.

I've tried different approaches, but am I really using Spark effectively?
I wondered if I would do something like:
>>> input=sc.parallelize(list)
>>> input.groupByKey().collect()

Then I wondered if I could do something like a foreach over each key value,
and then map the actual values and reduce them.  Pseudo-code:

.map(lambda x: x,1)
.reducebykey(lambda a,b:a+b)

I was somehow hoping that the key would get the current value of count, and
thus be the count of the unique keys, which is exactly what I think I'm
looking for.

Am I way off base on how I could accomplish this?


Reply via email to