Sounds like a job for Spark SQL: http://spark.apache.org/docs/latest/sql-programming-guide.html !
On Tue, Jul 15, 2014 at 11:25 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > You can use .distinct.count on your user RDD. > > What are you trying to achieve with the time group by? > — > Sent from Mailbox > > > On Tue, Jul 15, 2014 at 8:14 PM, buntu <buntu...@gmail.com> wrote: >> >> Hi -- >> >> New to Spark and trying to figure out how to do a generate unique counts >> per >> page by date given this raw data: >> >> timestamp,page,userId >> 1405377264,google,user1 >> 1405378589,google,user2 >> 1405380012,yahoo,user1 >> .. >> >> I can do a groupBy a field and get the count: >> >> val lines=sc.textFile("data.csv") >> val csv=lines.map(_.split(",")) >> // group by page >> csv.groupBy(_(1)).count >> >> But not able to see how to do count distinct on userId and also apply >> another groupBy on timestamp field. Please let me know how to handle such >> cases. >> >> Thanks! >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. > >