That's an interesting idea! I hadn't considered that. However, looking at the Partitioner interface, I would need to know from looking at a single key which doesn't fit my case, unfortunately. For my case, I need to compare successive pairs of keys. (I'm trying to re-join lines that were split prematurely.)
On Tue, Jun 30, 2015 at 2:07 PM, Abhishek R. Singh < abhis...@tetrationanalytics.com> wrote: > could you use a custom partitioner to preserve boundaries such that all > related tuples end up on the same partition? > > On Jun 30, 2015, at 12:00 PM, RJ Nowling <rnowl...@gmail.com> wrote: > > Thanks, Reynold. I still need to handle incomplete groups that fall > between partition boundaries. So, I need a two-pass approach. I came up > with a somewhat hacky way to handle those using the partition indices and > key-value pairs as a second pass after the first. > > OCaml's std library provides a function called group() that takes a break > function that operators on pairs of successive elements. It seems a > similar approach could be used in Spark and would be more efficient than my > approach with key-value pairs since you know the ordering of the partitions. > > Has this need been expressed by others? > > On Tue, Jun 30, 2015 at 1:03 PM, Reynold Xin <r...@databricks.com> wrote: > >> Try mapPartitions, which gives you an iterator, and you can produce an >> iterator back. >> >> >> On Tue, Jun 30, 2015 at 11:01 AM, RJ Nowling <rnowl...@gmail.com> wrote: >> >>> Hi all, >>> >>> I have a problem where I have a RDD of elements: >>> >>> Item1 Item2 Item3 Item4 Item5 Item6 ... >>> >>> and I want to run a function over them to decide which runs of elements >>> to group together: >>> >>> [Item1 Item2] [Item3] [Item4 Item5 Item6] ... >>> >>> Technically, I could use aggregate to do this, but I would have to use a >>> List of List of T which would produce a very large collection in memory. >>> >>> Is there an easy way to accomplish this? e.g.,, it would be nice to >>> have a version of aggregate where the combination function can return a >>> complete group that is added to the new RDD and an incomplete group which >>> is passed to the next call of the reduce function. >>> >>> Thanks, >>> RJ >>> >> >> > >