Re: Count distinct with groupBy usage

2014-07-15 Thread Nick Pentreath
You can use .distinct.count on your user RDD.


What are you trying to achieve with the time group by?
—
Sent from Mailbox

On Tue, Jul 15, 2014 at 8:14 PM, buntu buntu...@gmail.com wrote:

 Hi --
 New to Spark and trying to figure out how to do a generate unique counts per
 page by date given this raw data:
 timestamp,page,userId
 1405377264,google,user1
 1405378589,google,user2
 1405380012,yahoo,user1
 ..
 I can do a groupBy a field and get the count:
 val lines=sc.textFile(data.csv)
 val csv=lines.map(_.split(,))
 // group by page
 csv.groupBy(_(1)).count
 But not able to see how to do count distinct on userId and also apply
 another groupBy on timestamp field. Please let me know how to handle such
 cases. 
 Thanks!
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Count distinct with groupBy usage

2014-07-15 Thread Zongheng Yang
Sounds like a job for Spark SQL:
http://spark.apache.org/docs/latest/sql-programming-guide.html !

On Tue, Jul 15, 2014 at 11:25 AM, Nick Pentreath
nick.pentre...@gmail.com wrote:
 You can use .distinct.count on your user RDD.

 What are you trying to achieve with the time group by?
 —
 Sent from Mailbox


 On Tue, Jul 15, 2014 at 8:14 PM, buntu buntu...@gmail.com wrote:

 Hi --

 New to Spark and trying to figure out how to do a generate unique counts
 per
 page by date given this raw data:

 timestamp,page,userId
 1405377264,google,user1
 1405378589,google,user2
 1405380012,yahoo,user1
 ..

 I can do a groupBy a field and get the count:

 val lines=sc.textFile(data.csv)
 val csv=lines.map(_.split(,))
 // group by page
 csv.groupBy(_(1)).count

 But not able to see how to do count distinct on userId and also apply
 another groupBy on timestamp field. Please let me know how to handle such
 cases.

 Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
We have CDH 5.0.2 which doesn't include Spark SQL yet and may only be
available in CDH 5.1 which is yet to be released.

If Spark SQL is the only option then I might need to hack around to add it
into the current CDH deployment if thats possible.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9787.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
Thanks Nick.

All I'm attempting is to report number of unique visitors per page by date.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9786.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Count distinct with groupBy usage

2014-07-15 Thread Sean Owen
If you are counting per time and per page, then you need to group by
time and page not just page. Something more like:

csv.groupBy(csv = (csv(0),csv(1))) ...

This gives a list of users per (time,page). As Nick suggests, then you
count the distinct values for each key:

... .mapValues(_.distinct.count)

If you can tolerate some approximation, then using
countApproxDistinctByKey will be a lot faster.

csv.groupBy(csv = (csv(0),csv(1))).countApproxDistinctByKey()

On Tue, Jul 15, 2014 at 7:14 PM, buntu buntu...@gmail.com wrote:
 Hi --

 New to Spark and trying to figure out how to do a generate unique counts per
 page by date given this raw data:

 timestamp,page,userId
 1405377264,google,user1
 1405378589,google,user2
 1405380012,yahoo,user1
 ..

 I can do a groupBy a field and get the count:

 val lines=sc.textFile(data.csv)
 val csv=lines.map(_.split(,))
 // group by page
 csv.groupBy(_(1)).count

 But not able to see how to do count distinct on userId and also apply
 another groupBy on timestamp field. Please let me know how to handle such
 cases.

 Thanks!



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Count distinct with groupBy usage

2014-07-15 Thread buntu
Thanks Sean!! Thats what I was looking for -- group by on mulitple fields.

I'm gonna play with it now. Thanks again!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9803.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.