Yes, the problem is that the Java API inadvertently requires an Iterable return value, not an Iterator: https://issues.apache.org/jira/browse/SPARK-3369 I think this can't be fixed until Spark 2.x.
It seems possible to cheat and return a wrapper like the "IteratorIterable" I posted in the JIRA. You can return an Iterator instead this way, and as long as Spark happens to consume it only once, it will work fine. I don't know if this is guaranteed but seems to be the case anecdotally. On Thu, Oct 2, 2014 at 2:01 AM, Steve Lewis <lordjoe2...@gmail.com> wrote: > I number of the problems I want to work with generate datasets which are > too large to hold in memory. This becomes an issue when building a > FlatMapFunction and also when the data used in combineByKey cannot be held > in memory. > > The following is a simple, if a little silly, example of a > FlatMapFunction returning maxMultiples multiples of a long. It works well > for maxMultiples = 1000 but what happens if maxMultiples = 10 Billion. > The issue is that call cannot return a List or any other structure which > is held in memory. What can it return or is there another way to do this?? > > public static class GenerateMultiplesimplements FlatMapFunction<String, > String> { > private final long maxMultiples; > > public GenerateMultiplesimplements (final long maxMultiples ) { > this,maxMultiples = maxMultiples ; > } > > public Iterable<Long> call(Long l) { > List<Long> holder = new ArrayList<Long>(); > for (long factor = 1; factor < maxMultiples; factor++) { > holder.add(new Long(l * factor); > } > return holder; > } > } > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org