In what situation, you have such cases? If there is no shuffle, you can 
collapse all these functions into one, right? In the meantime, it is not 
recommended to collect
all data to driver.

Thanks.

Zhan Zhang

On Dec 21, 2015, at 3:44 AM, Zhiliang Zhu 
<zchl.j...@yahoo.com.INVALID<mailto:zchl.j...@yahoo.com.INVALID>> wrote:

Dear All,

I need to iterator some job / rdd quite a lot of times, but just lost in the 
problem of
spark only accept to call around 350 number of map before it meets one action 
Function ,
besides, dozens of action will obviously increase the run time.
Is there any proper way ...

As tested, there is piece of codes as follows:

......
 83     int count = 0;
 84     JavaRDD<Integer> dataSet = jsc.parallelize(list, 1).cache(); //with 
only 1 partition
 85     int m = 350;
 86     JavaRDD<Integer> r = dataSet.cache();
 87     JavaRDD<Integer> t = null;
 88
 89     for(int j=0; j < m; ++j) { //outer loop to temporarily convert the rdd 
r to t
 90       if(null != t) {
 91         r = t;
 92       }
            //inner loop to call map 350 times , if m is much more than 350 
(for instance, around 400), then the job will throw exception message
              "15/12/21 19:36:17 ERROR yarn.ApplicationMaster: User class threw 
exception: java.lang.StackOverflowError java.lang.StackOverflowError")
 93       for(int i=0; i < m; ++i) {
 94         r = r.map(new Function<Integer, Integer>() {
 95           @Override
 96           public Integer call(Integer integer) {
 97             double x = Math.random() * 2 - 1;
 98             double y = Math.random() * 2 - 1;
 99             return (x * x + y * y < 1) ? 1 : 0;
100           }
101         });

104       }
105
106       List<Integer> lt = r.collect(); //then collect this rdd to get 
another rdd, however, dozens of action Function as collect is VERY MUCH COST
107       t = jsc.parallelize(lt, 1).cache();
108
109     }
110
......

Thanks very much in advance!
Zhiliang


Reply via email to