could you use a custom partitioner to preserve boundaries such that all related tuples end up on the same partition?
On Jun 30, 2015, at 12:00 PM, RJ Nowling <rnowl...@gmail.com> wrote: > Thanks, Reynold. I still need to handle incomplete groups that fall between > partition boundaries. So, I need a two-pass approach. I came up with a > somewhat hacky way to handle those using the partition indices and key-value > pairs as a second pass after the first. > > OCaml's std library provides a function called group() that takes a break > function that operators on pairs of successive elements. It seems a similar > approach could be used in Spark and would be more efficient than my approach > with key-value pairs since you know the ordering of the partitions. > > Has this need been expressed by others? > > On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin <r...@databricks.com> wrote: > Try mapPartitions, which gives you an iterator, and you can produce an > iterator back. > > > On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rnowl...@gmail.com> wrote: > Hi all, > > I have a problem where I have a RDD of elements: > > Item1 Item2 Item3 Item4 Item5 Item6 ... > > and I want to run a function over them to decide which runs of elements to > group together: > > [Item1 Item2] [Item3] [Item4 Item5 Item6] ... > > Technically, I could use aggregate to do this, but I would have to use a List > of List of T which would produce a very large collection in memory. > > Is there an easy way to accomplish this? e.g.,, it would be nice to have a > version of aggregate where the combination function can return a complete > group that is added to the new RDD and an incomplete group which is passed to > the next call of the reduce function. > > Thanks, > RJ > >