Re: Count distinct with groupBy usage
You can use .distinct.count on your user RDD. What are you trying to achieve with the time group by? — Sent from Mailbox On Tue, Jul 15, 2014 at 8:14 PM, buntu buntu...@gmail.com wrote: Hi -- New to Spark and trying to figure out how to do a generate unique counts per page by date given this raw data: timestamp,page,userId 1405377264,google,user1 1405378589,google,user2 1405380012,yahoo,user1 .. I can do a groupBy a field and get the count: val lines=sc.textFile(data.csv) val csv=lines.map(_.split(,)) // group by page csv.groupBy(_(1)).count But not able to see how to do count distinct on userId and also apply another groupBy on timestamp field. Please let me know how to handle such cases. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Count distinct with groupBy usage
Sounds like a job for Spark SQL: http://spark.apache.org/docs/latest/sql-programming-guide.html ! On Tue, Jul 15, 2014 at 11:25 AM, Nick Pentreath nick.pentre...@gmail.com wrote: You can use .distinct.count on your user RDD. What are you trying to achieve with the time group by? — Sent from Mailbox On Tue, Jul 15, 2014 at 8:14 PM, buntu buntu...@gmail.com wrote: Hi -- New to Spark and trying to figure out how to do a generate unique counts per page by date given this raw data: timestamp,page,userId 1405377264,google,user1 1405378589,google,user2 1405380012,yahoo,user1 .. I can do a groupBy a field and get the count: val lines=sc.textFile(data.csv) val csv=lines.map(_.split(,)) // group by page csv.groupBy(_(1)).count But not able to see how to do count distinct on userId and also apply another groupBy on timestamp field. Please let me know how to handle such cases. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Count distinct with groupBy usage
We have CDH 5.0.2 which doesn't include Spark SQL yet and may only be available in CDH 5.1 which is yet to be released. If Spark SQL is the only option then I might need to hack around to add it into the current CDH deployment if thats possible. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9787.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Count distinct with groupBy usage
Thanks Nick. All I'm attempting is to report number of unique visitors per page by date. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9786.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Count distinct with groupBy usage
If you are counting per time and per page, then you need to group by time and page not just page. Something more like: csv.groupBy(csv = (csv(0),csv(1))) ... This gives a list of users per (time,page). As Nick suggests, then you count the distinct values for each key: ... .mapValues(_.distinct.count) If you can tolerate some approximation, then using countApproxDistinctByKey will be a lot faster. csv.groupBy(csv = (csv(0),csv(1))).countApproxDistinctByKey() On Tue, Jul 15, 2014 at 7:14 PM, buntu buntu...@gmail.com wrote: Hi -- New to Spark and trying to figure out how to do a generate unique counts per page by date given this raw data: timestamp,page,userId 1405377264,google,user1 1405378589,google,user2 1405380012,yahoo,user1 .. I can do a groupBy a field and get the count: val lines=sc.textFile(data.csv) val csv=lines.map(_.split(,)) // group by page csv.groupBy(_(1)).count But not able to see how to do count distinct on userId and also apply another groupBy on timestamp field. Please let me know how to handle such cases. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Count distinct with groupBy usage
Thanks Sean!! Thats what I was looking for -- group by on mulitple fields. I'm gonna play with it now. Thanks again! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9803.html Sent from the Apache Spark User List mailing list archive at Nabble.com.