Hmm I thought as much. I am using Cassandra with the Spark connector. What I really need is a RDD created from a query against Cassandra of the form "where partition_key = :id" where :id is taken from a list. Some grouping of the ids would be a way to partition this.
On Mon, Aug 18, 2014 at 3:42 PM, Sean Owen <so...@cloudera.com> wrote: > You won't be able to use RDDs inside of RDD operation. I imagine your > immediate problem is that the code you've elided references 'sc' and > that gets referenced by the PairFunction and serialized, but it can't > be. > > If you want to play it this way, parallelize across roots in Java. > That is just use an ExecutorService to launch a bunch of operations on > RDDs in parallel. There's no reason you can't do that, although I > suppose there are upper limits as to what makes sense on your cluster. > 1000 RDD count()s at once isn't a good idea for example. > > It may be the case that you don't really need a bunch of RDDs at all, > but can operate on an RDD of pairs of Strings (roots) and > something-elses, all at once. > > > On Mon, Aug 18, 2014 at 2:31 PM, David Tinker <david.tin...@gmail.com> > wrote: > > Hi All. > > > > I need to create a lot of RDDs starting from a set of "roots" and count > the > > rows in each. Something like this: > > > > final JavaSparkContext sc = new JavaSparkContext(conf); > > List<String> roots = ... > > Map<String, Object> res = sc.parallelize(roots).mapToPair(new > > PairFunction<String, String, Long>(){ > > public Tuple2<String, Long> call(String root) throws Exception { > > ... create RDD based on root from sc somehow ... > > return new Tuple2<String, Long>(root, rdd.count()) > > } > > }).countByKey() > > > > This fails with a message about JavaSparkContext not being serializable. > > > > Is there a way to get at the content inside of the map function or > should I > > be doing something else entirely? > > > > Thanks > > David > -- http://qdb.io/ Persistent Message Queues With Replay and #RabbitMQ Integration