Thanks, Reynold. I still need to handle incomplete groups that fall between partition boundaries. So, I need a two-pass approach. I came up with a somewhat hacky way to handle those using the partition indices and key-value pairs as a second pass after the first.
OCaml's std library provides a function called group() that takes a break function that operators on pairs of successive elements. It seems a similar approach could be used in Spark and would be more efficient than my approach with key-value pairs since you know the ordering of the partitions. Has this need been expressed by others? On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin <r...@databricks.com> wrote: > Try mapPartitions, which gives you an iterator, and you can produce an > iterator back. > > > On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rnowl...@gmail.com> wrote: > >> Hi all, >> >> I have a problem where I have a RDD of elements: >> >> Item1 Item2 Item3 Item4 Item5 Item6 ... >> >> and I want to run a function over them to decide which runs of elements >> to group together: >> >> [Item1 Item2] [Item3] [Item4 Item5 Item6] ... >> >> Technically, I could use aggregate to do this, but I would have to use a >> List of List of T which would produce a very large collection in memory. >> >> Is there an easy way to accomplish this? e.g.,, it would be nice to have >> a version of aggregate where the combination function can return a complete >> group that is added to the new RDD and an incomplete group which is passed >> to the next call of the reduce function. >> >> Thanks, >> RJ >> > >