of data that relates to a single key -
or is this only ever by coincidence?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Segmented-fold-count-tp12278p12478.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
, you can use RDD.mapPartitions(groupCount).collect()
On Sun, Aug 17, 2014 at 10:34 PM, fil f...@pobox.com wrote:
Can anyone assist with a scan of the following kind (Python preferred, but
whatever..)? I'm looking for a kind of segmented fold count.
Input: [1,1,1,2,2,3,4,4,5,1]
Output: [(1,3
()
On Sun, Aug 17, 2014 at 10:34 PM, fil f...@pobox.com wrote:
Can anyone assist with a scan of the following kind (Python preferred,
but
whatever..)? I'm looking for a kind of segmented fold count.
Input: [1,1,1,2,2,3,4,4,5,1]
Output: [(1,3), (2, 2), (3, 1), (4, 2), (5, 1), (1,1
,
but
whatever..)? I'm looking for a kind of segmented fold count.
Input: [1,1,1,2,2,3,4,4,5,1]
Output: [(1,3), (2, 2), (3, 1), (4, 2), (5, 1), (1,1)]
or preferably two output columns:
id: [1,2,3,4,5,1]
count: [3,2,1,2,1,1]
I can use a groupby/count, except for the fact that I just want to scan
.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Segmented-fold-count-tp12278p12295.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
to clarify anyone - when are partitions
used to describe chunks of data for different nodes in the cluster (ie.
large), and when are they used to describe groups of items in data (ie.
small)..
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Segmented-fold-count
.nabble.com/Segmented-fold-count-tp12278p12342.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h
Can anyone assist with a scan of the following kind (Python preferred, but
whatever..)? I'm looking for a kind of segmented fold count.
Input: [1,1,1,2,2,3,4,4,5,1]
Output: [(1,3), (2, 2), (3, 1), (4, 2), (5, 1), (1,1)]
or preferably two output columns:
id: [1,2,3,4,5,1]
count: [3,2,1,2,1,1