Yes, It's surpressing to me as well.... I tried to execute it with different configurations,
sudo -u hdfs spark-submit --master yarn-client --class com.mycompany.app.App --num-executors 40 --executor-memory 4g Example-1.0-SNAPSHOT.jar hdfs://ip:8020/tmp/sparkTest/ file22.bin parameters This is what I executed with different values in num-executors and executor-memory. What do you think there are too many executors for those HDDs? Could it be the reason because of each executor takes more time? 2015-02-06 9:36 GMT+01:00 Sandy Ryza <sandy.r...@cloudera.com>: > That's definitely surprising to me that you would be hitting a lot of GC for > this scenario. Are you setting --executor-cores and --executor-memory? > What are you setting them to? > > -Sandy > > On Thu, Feb 5, 2015 at 10:17 AM, Guillermo Ortiz <konstt2...@gmail.com> > wrote: >> >> Any idea why if I use more containers I get a lot of stopped because GC? >> >> 2015-02-05 8:59 GMT+01:00 Guillermo Ortiz <konstt2...@gmail.com>: >> > I'm not caching the data. with "each iteration I mean,, each 128mb >> > that a executor has to process. >> > >> > The code is pretty simple. >> > >> > final Conversor c = new Conversor(null, null, null, >> > longFields,typeFields); >> > SparkConf conf = new SparkConf().setAppName("Simple Application"); >> > JavaSparkContext sc = new JavaSparkContext(conf); >> > JavaRDD<byte[]> rdd = sc.binaryRecords(path, c.calculaLongBlock()); >> > >> > JavaRDD<String> rddString = rdd.map(new Function<byte[], String>() { >> > @Override >> > public String call(byte[] arg0) throws Exception { >> > String result = c.parse(arg0).toString(); >> > return result; >> > } >> > }); >> > rddString.saveAsTextFile(url + "/output/" + System.currentTimeMillis()+ >> > "/"); >> > >> > The parse function just takes an array of bytes and applies some >> > transformations like,,, >> > [0..3] an integer, [4...20] an String, [21..27] another String and so >> > on. >> > >> > It's just a test code, I'd like to understand what it's happeing. >> > >> > 2015-02-04 18:57 GMT+01:00 Sandy Ryza <sandy.r...@cloudera.com>: >> >> Hi Guillermo, >> >> >> >> What exactly do you mean by "each iteration"? Are you caching data in >> >> memory? >> >> >> >> -Sandy >> >> >> >> On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz <konstt2...@gmail.com> >> >> wrote: >> >>> >> >>> I execute a job in Spark where I'm processing a file of 80Gb in HDFS. >> >>> I have 5 slaves: >> >>> (32cores /256Gb / 7physical disks) x 5 >> >>> >> >>> I have been trying many different configurations with YARN. >> >>> yarn.nodemanager.resource.memory-mb 196Gb >> >>> yarn.nodemanager.resource.cpu-vcores 24 >> >>> >> >>> I have tried to execute the job with different number of executors a >> >>> memory (1-4g) >> >>> With 20 executors takes 25s each iteration (128mb) and it never has a >> >>> really long time waiting because GC. >> >>> >> >>> When I execute around 60 executors the process time it's about 45s and >> >>> some tasks take until one minute because GC. >> >>> >> >>> I have no idea why it's calling GC when I execute more executors >> >>> simultaneously. >> >>> The another question it's why it takes more time to execute each >> >>> block. My theory about the this it's because there're only 7 physical >> >>> disks and it's not the same 5 processes writing than 20. >> >>> >> >>> The code is pretty simple, it's just a map function which parse a line >> >>> and write the output in HDFS. There're a lot of substrings inside of >> >>> the function what it could cause GC. >> >>> >> >>> Any theory about? >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >>> For additional commands, e-mail: user-h...@spark.apache.org >> >>> >> >> > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org