Dear All,
I need to iterator some job / rdd quite a lot of times, but just lost in the
problem of spark only accept to call around 350 number of map before it meets
one action Function , besides, dozens of action will obviously increase the run
time.Is there any proper way ...
As tested, there is piece of codes as follows:
......
83 int count = 0; 84 JavaRDD<Integer> dataSet = jsc.parallelize(list,
1).cache(); //with only 1 partition 85 int m = 350; 86
JavaRDD<Integer> r = dataSet.cache(); 87 JavaRDD<Integer> t = null; 88 89
for(int j=0; j < m; ++j) { //outer loop to temporarily convert the rdd r to t
90 if(null != t) { 91 r = t; 92 } //inner loop
to call map 350 times , if m is much more than 350 (for instance, around 400),
then the job will throw exception message "15/12/21 19:36:17
ERROR yarn.ApplicationMaster: User class threw exception:
java.lang.StackOverflowError java.lang.StackOverflowError") 93 for(int
i=0; i < m; ++i) { 94 r = r.map(new Function<Integer, Integer>() { 95
@Override 96 public Integer call(Integer integer) { 97
double x = Math.random() * 2 - 1; 98 double y = Math.random()
* 2 - 1; 99 return (x * x + y * y < 1) ? 1 : 0;100 }101
});
104 }105106 List<Integer> lt = r.collect(); //then collect this rdd
to get another rdd, however, dozens of action Function as collect is VERY MUCH
COST107 t = jsc.parallelize(lt, 1).cache();108109 }110......
Thanks very much in advance!Zhiliang