How about this?

input.distinct().combineByKey((v: V) => 1, (agg: Int, x: Int) => agg + 1,
(agg1: Int, agg2: Int) => agg1 + agg2).collect()

On Mon, Apr 13, 2015 at 10:31 AM, Dean Wampler <deanwamp...@gmail.com>
wrote:

> The problem with using collect is that it will fail for large data sets,
> as you'll attempt to copy the entire RDD to the memory of your driver
> program. The following works (Scala syntax, but similar to Python):
>
> scala> val i1 = input.distinct.groupByKey
> scala> i1.foreach(println)
> (1,CompactBuffer(beta, alpha, foo))
> (3,CompactBuffer(foo))
> (2,CompactBuffer(alpha, bar))
>
> scala> val i2 = i1.map(tup => (tup._1, tup._2.size))
> scala> i1.foreach(println)
> (1,3)
> (3,1)
> (2,2)
>
> The "i2" line passes a function that takes a tuple argument, then
> constructs a new output tuple with the first element and the size of the
> second (each CompactBuffer). An alternative pattern match syntax would be.
>
> scala> val i2 = i1.map { case (key, buffer) => (key, buffer.size) }
>
> This should work as long as none of the CompactBuffers are too large,
> which could happen for extremely large data sets.
>
> dean
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Mon, Apr 13, 2015 at 11:45 AM, Marco Shaw <marco.s...@gmail.com> wrote:
>
>> **Learning the ropes**
>>
>> I'm trying to grasp the concept of using the pipeline in pySpark...
>>
>> Simplified example:
>> >>>
>> list=[(1,"alpha"),(1,"beta"),(1,"foo"),(1,"alpha"),(2,"alpha"),(2,"alpha"),(2,"bar"),(3,"foo")]
>>
>> Desired outcome:
>> [(1,3),(2,2),(3,1)]
>>
>> Basically for each key, I want the number of unique values.
>>
>> I've tried different approaches, but am I really using Spark
>> effectively?  I wondered if I would do something like:
>> >>> input=sc.parallelize(list)
>> >>> input.groupByKey().collect()
>>
>> Then I wondered if I could do something like a foreach over each key
>> value, and then map the actual values and reduce them.  Pseudo-code:
>>
>> input.groupbykey()
>> .keys
>> .foreach(_.values
>> .map(lambda x: x,1)
>> .reducebykey(lambda a,b:a+b)
>> .count()
>> )
>>
>> I was somehow hoping that the key would get the current value of count,
>> and thus be the count of the unique keys, which is exactly what I think I'm
>> looking for.
>>
>> Am I way off base on how I could accomplish this?
>>
>> Marco
>>
>
>

Reply via email to