Re: Problems with GC and time to execute with different number of executors.
That's definitely surprising to me that you would be hitting a lot of GC for this scenario. Are you setting --executor-cores and --executor-memory? What are you setting them to? -Sandy On Thu, Feb 5, 2015 at 10:17 AM, Guillermo Ortiz konstt2...@gmail.com wrote: Any idea why if I use more containers I get a lot of stopped because GC? 2015-02-05 8:59 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com: I'm not caching the data. with each iteration I mean,, each 128mb that a executor has to process. The code is pretty simple. final Conversor c = new Conversor(null, null, null, longFields,typeFields); SparkConf conf = new SparkConf().setAppName(Simple Application); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDDbyte[] rdd = sc.binaryRecords(path, c.calculaLongBlock()); JavaRDDString rddString = rdd.map(new Functionbyte[], String() { @Override public String call(byte[] arg0) throws Exception { String result = c.parse(arg0).toString(); return result; } }); rddString.saveAsTextFile(url + /output/ + System.currentTimeMillis()+ /); The parse function just takes an array of bytes and applies some transformations like,,, [0..3] an integer, [4...20] an String, [21..27] another String and so on. It's just a test code, I'd like to understand what it's happeing. 2015-02-04 18:57 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com: Hi Guillermo, What exactly do you mean by each iteration? Are you caching data in memory? -Sandy On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz konstt2...@gmail.com wrote: I execute a job in Spark where I'm processing a file of 80Gb in HDFS. I have 5 slaves: (32cores /256Gb / 7physical disks) x 5 I have been trying many different configurations with YARN. yarn.nodemanager.resource.memory-mb 196Gb yarn.nodemanager.resource.cpu-vcores 24 I have tried to execute the job with different number of executors a memory (1-4g) With 20 executors takes 25s each iteration (128mb) and it never has a really long time waiting because GC. When I execute around 60 executors the process time it's about 45s and some tasks take until one minute because GC. I have no idea why it's calling GC when I execute more executors simultaneously. The another question it's why it takes more time to execute each block. My theory about the this it's because there're only 7 physical disks and it's not the same 5 processes writing than 20. The code is pretty simple, it's just a map function which parse a line and write the output in HDFS. There're a lot of substrings inside of the function what it could cause GC. Any theory about? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Problems with GC and time to execute with different number of executors.
This is an execution with 80 executors MetricMin25th percentileMedian75th percentileMax Duration 31s 44s 50s 1.1min 2.6 min GC Time 70ms 0.1s 0.3s 4s 53 s Input 128.0MB 128.0MB 128.0MB 128.0MB 128.0MB I executed as well with 40 executors MetricMin25th percentileMedian75th percentileMax Duration 26s 28s 28s 30s 35s GC Time 54ms 60ms 66ms 80ms 0.4 s Input 128.0MB 128.0MB 128.0MB 128.0MB 128.0 MB I checked the %iowait and %steal in a worker it's all right in both of them I understand the value of yarn.nodemanager.resource.memory-mb is for each worker in the cluster and not the total value for YARN. it's configured at 196GB right now. (I have 5 workers) 80executors x 4Gb = 320Gb, it shouldn't be a problem. 2015-02-06 10:03 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com: Yes, having many more cores than disks and all writing at the same time can definitely cause performance issues. Though that wouldn't explain the high GC. What percent of task time does the web UI report that tasks are spending in GC? On Fri, Feb 6, 2015 at 12:56 AM, Guillermo Ortiz konstt2...@gmail.com wrote: Yes, It's surpressing to me as well I tried to execute it with different configurations, sudo -u hdfs spark-submit --master yarn-client --class com.mycompany.app.App --num-executors 40 --executor-memory 4g Example-1.0-SNAPSHOT.jar hdfs://ip:8020/tmp/sparkTest/ file22.bin parameters This is what I executed with different values in num-executors and executor-memory. What do you think there are too many executors for those HDDs? Could it be the reason because of each executor takes more time? 2015-02-06 9:36 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com: That's definitely surprising to me that you would be hitting a lot of GC for this scenario. Are you setting --executor-cores and --executor-memory? What are you setting them to? -Sandy On Thu, Feb 5, 2015 at 10:17 AM, Guillermo Ortiz konstt2...@gmail.com wrote: Any idea why if I use more containers I get a lot of stopped because GC? 2015-02-05 8:59 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com: I'm not caching the data. with each iteration I mean,, each 128mb that a executor has to process. The code is pretty simple. final Conversor c = new Conversor(null, null, null, longFields,typeFields); SparkConf conf = new SparkConf().setAppName(Simple Application); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDDbyte[] rdd = sc.binaryRecords(path, c.calculaLongBlock()); JavaRDDString rddString = rdd.map(new Functionbyte[], String() { @Override public String call(byte[] arg0) throws Exception { String result = c.parse(arg0).toString(); return result; } }); rddString.saveAsTextFile(url + /output/ + System.currentTimeMillis()+ /); The parse function just takes an array of bytes and applies some transformations like,,, [0..3] an integer, [4...20] an String, [21..27] another String and so on. It's just a test code, I'd like to understand what it's happeing. 2015-02-04 18:57 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com: Hi Guillermo, What exactly do you mean by each iteration? Are you caching data in memory? -Sandy On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz konstt2...@gmail.com wrote: I execute a job in Spark where I'm processing a file of 80Gb in HDFS. I have 5 slaves: (32cores /256Gb / 7physical disks) x 5 I have been trying many different configurations with YARN. yarn.nodemanager.resource.memory-mb 196Gb yarn.nodemanager.resource.cpu-vcores 24 I have tried to execute the job with different number of executors a memory (1-4g) With 20 executors takes 25s each iteration (128mb) and it never has a really long time waiting because GC. When I execute around 60 executors the process time it's about 45s and some tasks take until one minute because GC. I have no idea why it's calling GC when I execute more executors simultaneously. The another question it's why it takes more time to execute each block. My theory about the this it's because there're only 7 physical disks and it's not the same 5 processes writing than 20. The code is pretty simple, it's just a map function which parse a line and write the output in HDFS. There're a lot of substrings inside of the function what it could cause GC. Any theory about? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Problems with GC and time to execute with different number of executors.
Yes, It's surpressing to me as well I tried to execute it with different configurations, sudo -u hdfs spark-submit --master yarn-client --class com.mycompany.app.App --num-executors 40 --executor-memory 4g Example-1.0-SNAPSHOT.jar hdfs://ip:8020/tmp/sparkTest/ file22.bin parameters This is what I executed with different values in num-executors and executor-memory. What do you think there are too many executors for those HDDs? Could it be the reason because of each executor takes more time? 2015-02-06 9:36 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com: That's definitely surprising to me that you would be hitting a lot of GC for this scenario. Are you setting --executor-cores and --executor-memory? What are you setting them to? -Sandy On Thu, Feb 5, 2015 at 10:17 AM, Guillermo Ortiz konstt2...@gmail.com wrote: Any idea why if I use more containers I get a lot of stopped because GC? 2015-02-05 8:59 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com: I'm not caching the data. with each iteration I mean,, each 128mb that a executor has to process. The code is pretty simple. final Conversor c = new Conversor(null, null, null, longFields,typeFields); SparkConf conf = new SparkConf().setAppName(Simple Application); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDDbyte[] rdd = sc.binaryRecords(path, c.calculaLongBlock()); JavaRDDString rddString = rdd.map(new Functionbyte[], String() { @Override public String call(byte[] arg0) throws Exception { String result = c.parse(arg0).toString(); return result; } }); rddString.saveAsTextFile(url + /output/ + System.currentTimeMillis()+ /); The parse function just takes an array of bytes and applies some transformations like,,, [0..3] an integer, [4...20] an String, [21..27] another String and so on. It's just a test code, I'd like to understand what it's happeing. 2015-02-04 18:57 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com: Hi Guillermo, What exactly do you mean by each iteration? Are you caching data in memory? -Sandy On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz konstt2...@gmail.com wrote: I execute a job in Spark where I'm processing a file of 80Gb in HDFS. I have 5 slaves: (32cores /256Gb / 7physical disks) x 5 I have been trying many different configurations with YARN. yarn.nodemanager.resource.memory-mb 196Gb yarn.nodemanager.resource.cpu-vcores 24 I have tried to execute the job with different number of executors a memory (1-4g) With 20 executors takes 25s each iteration (128mb) and it never has a really long time waiting because GC. When I execute around 60 executors the process time it's about 45s and some tasks take until one minute because GC. I have no idea why it's calling GC when I execute more executors simultaneously. The another question it's why it takes more time to execute each block. My theory about the this it's because there're only 7 physical disks and it's not the same 5 processes writing than 20. The code is pretty simple, it's just a map function which parse a line and write the output in HDFS. There're a lot of substrings inside of the function what it could cause GC. Any theory about? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Problems with GC and time to execute with different number of executors.
Yes, having many more cores than disks and all writing at the same time can definitely cause performance issues. Though that wouldn't explain the high GC. What percent of task time does the web UI report that tasks are spending in GC? On Fri, Feb 6, 2015 at 12:56 AM, Guillermo Ortiz konstt2...@gmail.com wrote: Yes, It's surpressing to me as well I tried to execute it with different configurations, sudo -u hdfs spark-submit --master yarn-client --class com.mycompany.app.App --num-executors 40 --executor-memory 4g Example-1.0-SNAPSHOT.jar hdfs://ip:8020/tmp/sparkTest/ file22.bin parameters This is what I executed with different values in num-executors and executor-memory. What do you think there are too many executors for those HDDs? Could it be the reason because of each executor takes more time? 2015-02-06 9:36 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com: That's definitely surprising to me that you would be hitting a lot of GC for this scenario. Are you setting --executor-cores and --executor-memory? What are you setting them to? -Sandy On Thu, Feb 5, 2015 at 10:17 AM, Guillermo Ortiz konstt2...@gmail.com wrote: Any idea why if I use more containers I get a lot of stopped because GC? 2015-02-05 8:59 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com: I'm not caching the data. with each iteration I mean,, each 128mb that a executor has to process. The code is pretty simple. final Conversor c = new Conversor(null, null, null, longFields,typeFields); SparkConf conf = new SparkConf().setAppName(Simple Application); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDDbyte[] rdd = sc.binaryRecords(path, c.calculaLongBlock()); JavaRDDString rddString = rdd.map(new Functionbyte[], String() { @Override public String call(byte[] arg0) throws Exception { String result = c.parse(arg0).toString(); return result; } }); rddString.saveAsTextFile(url + /output/ + System.currentTimeMillis()+ /); The parse function just takes an array of bytes and applies some transformations like,,, [0..3] an integer, [4...20] an String, [21..27] another String and so on. It's just a test code, I'd like to understand what it's happeing. 2015-02-04 18:57 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com: Hi Guillermo, What exactly do you mean by each iteration? Are you caching data in memory? -Sandy On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz konstt2...@gmail.com wrote: I execute a job in Spark where I'm processing a file of 80Gb in HDFS. I have 5 slaves: (32cores /256Gb / 7physical disks) x 5 I have been trying many different configurations with YARN. yarn.nodemanager.resource.memory-mb 196Gb yarn.nodemanager.resource.cpu-vcores 24 I have tried to execute the job with different number of executors a memory (1-4g) With 20 executors takes 25s each iteration (128mb) and it never has a really long time waiting because GC. When I execute around 60 executors the process time it's about 45s and some tasks take until one minute because GC. I have no idea why it's calling GC when I execute more executors simultaneously. The another question it's why it takes more time to execute each block. My theory about the this it's because there're only 7 physical disks and it's not the same 5 processes writing than 20. The code is pretty simple, it's just a map function which parse a line and write the output in HDFS. There're a lot of substrings inside of the function what it could cause GC. Any theory about? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Problems with GC and time to execute with different number of executors.
I'm not caching the data. with each iteration I mean,, each 128mb that a executor has to process. The code is pretty simple. final Conversor c = new Conversor(null, null, null, longFields,typeFields); SparkConf conf = new SparkConf().setAppName(Simple Application); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDDbyte[] rdd = sc.binaryRecords(path, c.calculaLongBlock()); JavaRDDString rddString = rdd.map(new Functionbyte[], String() { @Override public String call(byte[] arg0) throws Exception { String result = c.parse(arg0).toString(); return result; } }); rddString.saveAsTextFile(url + /output/ + System.currentTimeMillis()+ /); The parse function just takes an array of bytes and applies some transformations like,,, [0..3] an integer, [4...20] an String, [21..27] another String and so on. It's just a test code, I'd like to understand what it's happeing. 2015-02-04 18:57 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com: Hi Guillermo, What exactly do you mean by each iteration? Are you caching data in memory? -Sandy On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz konstt2...@gmail.com wrote: I execute a job in Spark where I'm processing a file of 80Gb in HDFS. I have 5 slaves: (32cores /256Gb / 7physical disks) x 5 I have been trying many different configurations with YARN. yarn.nodemanager.resource.memory-mb 196Gb yarn.nodemanager.resource.cpu-vcores 24 I have tried to execute the job with different number of executors a memory (1-4g) With 20 executors takes 25s each iteration (128mb) and it never has a really long time waiting because GC. When I execute around 60 executors the process time it's about 45s and some tasks take until one minute because GC. I have no idea why it's calling GC when I execute more executors simultaneously. The another question it's why it takes more time to execute each block. My theory about the this it's because there're only 7 physical disks and it's not the same 5 processes writing than 20. The code is pretty simple, it's just a map function which parse a line and write the output in HDFS. There're a lot of substrings inside of the function what it could cause GC. Any theory about? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Problems with GC and time to execute with different number of executors.
Any idea why if I use more containers I get a lot of stopped because GC? 2015-02-05 8:59 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com: I'm not caching the data. with each iteration I mean,, each 128mb that a executor has to process. The code is pretty simple. final Conversor c = new Conversor(null, null, null, longFields,typeFields); SparkConf conf = new SparkConf().setAppName(Simple Application); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDDbyte[] rdd = sc.binaryRecords(path, c.calculaLongBlock()); JavaRDDString rddString = rdd.map(new Functionbyte[], String() { @Override public String call(byte[] arg0) throws Exception { String result = c.parse(arg0).toString(); return result; } }); rddString.saveAsTextFile(url + /output/ + System.currentTimeMillis()+ /); The parse function just takes an array of bytes and applies some transformations like,,, [0..3] an integer, [4...20] an String, [21..27] another String and so on. It's just a test code, I'd like to understand what it's happeing. 2015-02-04 18:57 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com: Hi Guillermo, What exactly do you mean by each iteration? Are you caching data in memory? -Sandy On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz konstt2...@gmail.com wrote: I execute a job in Spark where I'm processing a file of 80Gb in HDFS. I have 5 slaves: (32cores /256Gb / 7physical disks) x 5 I have been trying many different configurations with YARN. yarn.nodemanager.resource.memory-mb 196Gb yarn.nodemanager.resource.cpu-vcores 24 I have tried to execute the job with different number of executors a memory (1-4g) With 20 executors takes 25s each iteration (128mb) and it never has a really long time waiting because GC. When I execute around 60 executors the process time it's about 45s and some tasks take until one minute because GC. I have no idea why it's calling GC when I execute more executors simultaneously. The another question it's why it takes more time to execute each block. My theory about the this it's because there're only 7 physical disks and it's not the same 5 processes writing than 20. The code is pretty simple, it's just a map function which parse a line and write the output in HDFS. There're a lot of substrings inside of the function what it could cause GC. Any theory about? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Problems with GC and time to execute with different number of executors.
I execute a job in Spark where I'm processing a file of 80Gb in HDFS. I have 5 slaves: (32cores /256Gb / 7physical disks) x 5 I have been trying many different configurations with YARN. yarn.nodemanager.resource.memory-mb 196Gb yarn.nodemanager.resource.cpu-vcores 24 I have tried to execute the job with different number of executors a memory (1-4g) With 20 executors takes 25s each iteration (128mb) and it never has a really long time waiting because GC. When I execute around 60 executors the process time it's about 45s and some tasks take until one minute because GC. I have no idea why it's calling GC when I execute more executors simultaneously. The another question it's why it takes more time to execute each block. My theory about the this it's because there're only 7 physical disks and it's not the same 5 processes writing than 20. The code is pretty simple, it's just a map function which parse a line and write the output in HDFS. There're a lot of substrings inside of the function what it could cause GC. Any theory about? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org