Thanks Matei and Mridul - was basically wondering whether we would be able
to change the shuffle to accommodate this after 1.0, and from your answers
it sounds like we can.


On Mon, Apr 21, 2014 at 12:31 AM, Mridul Muralidharan <mri...@gmail.com>wrote:

> As Matei mentioned, the Values is now an Iterable : which can be disk
> backed.
> Does that not address the concern ?
>
> @Patrick - we do have cases where the length of the sequence is large
> and size per value is also non trivial : so we do need this :-)
> Note that join is a trivial example where this is required (in our
> current implementation).
>
> Regards,
> Mridul
>
> On Mon, Apr 21, 2014 at 6:25 AM, Sandy Ryza <sandy.r...@cloudera.com>
> wrote:
> > The issue isn't that the Iterator[P] can't be disk-backed.  It's that,
> with
> > a groupBy, each P is a (Key, Values) tuple, and the entire tuple is read
> > into memory at once.  The ShuffledRDD is agnostic to what goes inside P.
> >
> > On Sun, Apr 20, 2014 at 11:36 AM, Mridul Muralidharan <mri...@gmail.com
> >wrote:
> >
> >> An iterator does not imply data has to be memory resident.
> >> Think merge sort output as an iterator (disk backed).
> >>
> >> Tom is actually planning to work on something similar with me on this
> >> hopefully this or next month.
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >> On Sun, Apr 20, 2014 at 11:46 PM, Sandy Ryza <sandy.r...@cloudera.com>
> >> wrote:
> >> > Hey all,
> >> >
> >> > After a shuffle / groupByKey, Hadoop MapReduce allows the values for a
> >> key
> >> > to not all fit in memory.  The current ShuffleFetcher.fetch API, which
> >> > doesn't distinguish between keys and values, only returning an
> >> Iterator[P],
> >> > seems incompatible with this.
> >> >
> >> > Any thoughts on how we could achieve parity here?
> >> >
> >> > -Sandy
> >>
>

Reply via email to