Hi Koert, cogroup is a transformation on RDD and it creates a cogroupRDD and then perform some transformations on it. When later an action is called, the compute() method of the cogroupRDD will be called. Roughly speaking, each element in cogroupRDD is fetched one at a time. Thus the contents of the two iterables do not need to fit in memory.
Hope this helps! Liq On Mon, Sep 29, 2014 at 4:02 PM, Koert Kuipers <ko...@tresata.com> wrote: > apologies for asking yet again about spark memory assumptions, but i cant > seem to keep it in my head. > > if i use PairRDDFunctions.cogroup, it returns for every key 2 iterables. > do the contents of these iterables have to fit in memory? or is the data > streamed? > > -- Liquan Pei Department of Physics University of Massachusetts Amherst