Sounds like a job for Spark SQL:
http://spark.apache.org/docs/latest/sql-programming-guide.html !

On Tue, Jul 15, 2014 at 11:25 AM, Nick Pentreath
<nick.pentre...@gmail.com> wrote:
> You can use .distinct.count on your user RDD.
>
> What are you trying to achieve with the time group by?
> —
> Sent from Mailbox
>
>
> On Tue, Jul 15, 2014 at 8:14 PM, buntu <buntu...@gmail.com> wrote:
>>
>> Hi --
>>
>> New to Spark and trying to figure out how to do a generate unique counts
>> per
>> page by date given this raw data:
>>
>> timestamp,page,userId
>> 1405377264,google,user1
>> 1405378589,google,user2
>> 1405380012,yahoo,user1
>> ..
>>
>> I can do a groupBy a field and get the count:
>>
>> val lines=sc.textFile("data.csv")
>> val csv=lines.map(_.split(","))
>> // group by page
>> csv.groupBy(_(1)).count
>>
>> But not able to see how to do count distinct on userId and also apply
>> another groupBy on timestamp field. Please let me know how to handle such
>> cases.
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>

Reply via email to