Re: Working with many RDDs in parallel?

David Tinker Mon, 18 Aug 2014 22:53:49 -0700

Hmm I thought as much. I am using Cassandra with the Spark connector. What
I really need is a RDD created from a query against Cassandra of the form
"where partition_key = :id" where :id is taken from a list. Some grouping
of the ids would be a way to partition this.



On Mon, Aug 18, 2014 at 3:42 PM, Sean Owen <so...@cloudera.com> wrote:

> You won't be able to use RDDs inside of RDD operation. I imagine your
> immediate problem is that the code you've elided references 'sc' and
> that gets referenced by the PairFunction and serialized, but it can't
> be.
>
> If you want to play it this way, parallelize across roots in Java.
> That is just use an ExecutorService to launch a bunch of operations on
> RDDs in parallel. There's no reason you can't do that, although I
> suppose there are upper limits as to what makes sense on your cluster.
> 1000 RDD count()s at once isn't a good idea for example.
>
> It may be the case that you don't really need a bunch of RDDs at all,
> but can operate on an RDD of pairs of Strings (roots) and
> something-elses, all at once.
>
>
> On Mon, Aug 18, 2014 at 2:31 PM, David Tinker <david.tin...@gmail.com>
> wrote:
> > Hi All.
> >
> > I need to create a lot of RDDs starting from a set of "roots" and count
> the
> > rows in each. Something like this:
> >
> > final JavaSparkContext sc = new JavaSparkContext(conf);
> > List<String> roots = ...
> > Map<String, Object> res = sc.parallelize(roots).mapToPair(new
> > PairFunction<String, String, Long>(){
> >     public Tuple2<String, Long> call(String root) throws Exception {
> >         ... create RDD based on root from sc somehow ...
> >         return new Tuple2<String, Long>(root, rdd.count())
> >     }
> > }).countByKey()
> >
> > This fails with a message about JavaSparkContext not being serializable.
> >
> > Is there a way to get at the content inside of the map function or
> should I
> > be doing something else entirely?
> >
> > Thanks
> > David
>



-- 
http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ
Integration

Re: Working with many RDDs in parallel?

Reply via email to