Re: Segmented fold count

2014-08-20 Thread fil
of data that relates to a single key - or is this only ever by coincidence? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Segmented-fold-count-tp12278p12478.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Segmented fold count

2014-08-18 Thread Davies Liu
, you can use RDD.mapPartitions(groupCount).collect() On Sun, Aug 17, 2014 at 10:34 PM, fil f...@pobox.com wrote: Can anyone assist with a scan of the following kind (Python preferred, but whatever..)? I'm looking for a kind of segmented fold count. Input: [1,1,1,2,2,3,4,4,5,1] Output: [(1,3

Re: Segmented fold count

2014-08-18 Thread Andrew Ash
() On Sun, Aug 17, 2014 at 10:34 PM, fil f...@pobox.com wrote: Can anyone assist with a scan of the following kind (Python preferred, but whatever..)? I'm looking for a kind of segmented fold count. Input: [1,1,1,2,2,3,4,4,5,1] Output: [(1,3), (2, 2), (3, 1), (4, 2), (5, 1), (1,1

Re: Segmented fold count

2014-08-18 Thread Davies Liu
, but whatever..)? I'm looking for a kind of segmented fold count. Input: [1,1,1,2,2,3,4,4,5,1] Output: [(1,3), (2, 2), (3, 1), (4, 2), (5, 1), (1,1)] or preferably two output columns: id: [1,2,3,4,5,1] count: [3,2,1,2,1,1] I can use a groupby/count, except for the fact that I just want to scan

Re: Segmented fold count

2014-08-18 Thread fil
. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Segmented-fold-count-tp12278p12295.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Segmented fold count

2014-08-18 Thread fil
to clarify anyone - when are partitions used to describe chunks of data for different nodes in the cluster (ie. large), and when are they used to describe groups of items in data (ie. small).. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Segmented-fold-count

Re: Segmented fold count

2014-08-18 Thread Davies Liu
.nabble.com/Segmented-fold-count-tp12278p12342.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Segmented fold count

2014-08-17 Thread fil
Can anyone assist with a scan of the following kind (Python preferred, but whatever..)? I'm looking for a kind of segmented fold count. Input: [1,1,1,2,2,3,4,4,5,1] Output: [(1,3), (2, 2), (3, 1), (4, 2), (5, 1), (1,1)] or preferably two output columns: id: [1,2,3,4,5,1] count: [3,2,1,2,1,1