Another approach could be to create artificial keys for each RDD and convert to PairRDDs. So your first RDD becomes JavaPairRDD<Int,String> rdd1 with values 1,"1" ; 1,"2" and so on
Second RDD becomes rdd2 is 2, "a"; 2, "b";2,"c" You can union the two RDDs, groupByKey, countByKey etc and maybe achieve what you are trying to do. Sorry this is just a hypothesis, as I am not entirely sure about what you are trying to achieve. Ideally, I would think hard whether multiple RDDs are indeed needed, just as Sean pointed out. Best Regards, Sonal Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Wed, Oct 22, 2014 at 8:35 PM, Sean Owen <so...@cloudera.com> wrote: > No, there's no such thing as an RDD of RDDs in Spark. > Here though, why not just operate on an RDD of Lists? or a List of RDDs? > Usually one of these two is the right approach whenever you feel > inclined to operate on an RDD of RDDs. > > On Wed, Oct 22, 2014 at 3:58 PM, Tomer Benyamini <tomer....@gmail.com> > wrote: > > Hello, > > > > I would like to parallelize my work on multiple RDDs I have. I wanted > > to know if spark can support a "foreach" on an RDD of RDDs. Here's a > > java example: > > > > public static void main(String[] args) { > > > > SparkConf sparkConf = new SparkConf().setAppName("testapp"); > > sparkConf.setMaster("local"); > > > > JavaSparkContext sc = new JavaSparkContext(sparkConf); > > > > > > List<String> list = Arrays.asList(new String[] {"1", "2", "3"}); > > JavaRDD<String> rdd = sc.parallelize(list); > > > > List<String> list1 = Arrays.asList(new String[] {"a", "b", "c"}); > > JavaRDD<String> rdd1 = sc.parallelize(list1); > > > > List<JavaRDD<String>> rddList = new ArrayList<JavaRDD<String>>(); > > rddList.add(rdd); > > rddList.add(rdd1); > > > > > > JavaRDD<JavaRDD<String>> rddOfRdds = sc.parallelize(rddList); > > System.out.println(rddOfRdds.count()); > > > > > > rddOfRdds.foreach(new VoidFunction<JavaRDD<String>>() { > > > > @Override > > public void call(JavaRDD<String> t) throws Exception { > > System.out.println(t.count()); > > } > > > > }); > > } > > > > From this code I'm getting a NullPointerException on the internal count > method: > > > > Exception in thread "main" org.apache.spark.SparkException: Job > > aborted due to stage failure: Task 1.0:0 failed 1 times, most recent > > failure: Exception failure in TID 1 on host localhost: > > java.lang.NullPointerException > > > > org.apache.spark.rdd.RDD.count(RDD.scala:861) > > > > > org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:365) > > > > org.apache.spark.api.java.JavaRDD.count(JavaRDD.scala:29) > > > > Help will be appreciated. > > > > Thanks, > > Tomer > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >