Re: Working with many RDDs in parallel?

Sean Owen Mon, 18 Aug 2014 06:49:27 -0700

You won't be able to use RDDs inside of RDD operation. I imagine your
immediate problem is that the code you've elided references 'sc' and
that gets referenced by the PairFunction and serialized, but it can't
be.

If you want to play it this way, parallelize across roots in Java.
That is just use an ExecutorService to launch a bunch of operations on
RDDs in parallel. There's no reason you can't do that, although I
suppose there are upper limits as to what makes sense on your cluster.
1000 RDD count()s at once isn't a good idea for example.

It may be the case that you don't really need a bunch of RDDs at all,
but can operate on an RDD of pairs of Strings (roots) and
something-elses, all at once.

On Mon, Aug 18, 2014 at 2:31 PM, David Tinker <david.tin...@gmail.com> wrote:
> Hi All.
>
> I need to create a lot of RDDs starting from a set of "roots" and count the
> rows in each. Something like this:
>
> final JavaSparkContext sc = new JavaSparkContext(conf);
> List<String> roots = ...
> Map<String, Object> res = sc.parallelize(roots).mapToPair(new
> PairFunction<String, String, Long>(){
>     public Tuple2<String, Long> call(String root) throws Exception {
>         ... create RDD based on root from sc somehow ...
>         return new Tuple2<String, Long>(root, rdd.count())
>     }
> }).countByKey()
>
> This fails with a message about JavaSparkContext not being serializable.
>
> Is there a way to get at the content inside of the map function or should I
> be doing something else entirely?
>
> Thanks
> David

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Working with many RDDs in parallel?

Reply via email to