Re: Problems with GC and time to execute with different number of executors.

2015-02-06 Thread Sandy Ryza
That's definitely surprising to me that you would be hitting a lot of GC
for this scenario.  Are you setting --executor-cores and
--executor-memory?  What are you setting them to?

-Sandy

On Thu, Feb 5, 2015 at 10:17 AM, Guillermo Ortiz konstt2...@gmail.com
wrote:

 Any idea why if I use more containers I get a lot of stopped because GC?

 2015-02-05 8:59 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com:
  I'm not caching the data. with each iteration I mean,, each 128mb
  that a executor has to process.
 
  The code is pretty simple.
 
  final Conversor c = new Conversor(null, null, null,
 longFields,typeFields);
  SparkConf conf = new SparkConf().setAppName(Simple Application);
  JavaSparkContext sc = new JavaSparkContext(conf);
  JavaRDDbyte[] rdd = sc.binaryRecords(path, c.calculaLongBlock());
 
   JavaRDDString rddString = rdd.map(new Functionbyte[], String() {
   @Override
public String call(byte[] arg0) throws Exception {
   String result = c.parse(arg0).toString();
return result;
  }
   });
  rddString.saveAsTextFile(url + /output/ + System.currentTimeMillis()+
 /);
 
  The parse function just takes an array of bytes and applies some
  transformations like,,,
  [0..3] an integer, [4...20] an String, [21..27] another String and so on.
 
  It's just a test code, I'd like to understand what it's happeing.
 
  2015-02-04 18:57 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com:
  Hi Guillermo,
 
  What exactly do you mean by each iteration?  Are you caching data in
  memory?
 
  -Sandy
 
  On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz konstt2...@gmail.com
  wrote:
 
  I execute a job in Spark where I'm processing a file of 80Gb in HDFS.
  I have 5 slaves:
  (32cores /256Gb / 7physical disks) x 5
 
  I have been trying many different configurations with YARN.
  yarn.nodemanager.resource.memory-mb 196Gb
  yarn.nodemanager.resource.cpu-vcores 24
 
  I have tried to execute the job with different number of executors a
  memory (1-4g)
  With 20 executors takes 25s each iteration (128mb) and it never has a
  really long time waiting because GC.
 
  When I execute around 60 executors the process time it's about 45s and
  some tasks take until one minute because GC.
 
  I have no idea why it's calling GC when I execute more executors
  simultaneously.
  The another question it's why it takes more time to execute each
  block. My theory about the this it's because there're only 7 physical
  disks and it's not the same 5 processes writing than 20.
 
  The code is pretty simple, it's just a map function which parse a line
  and write the output in HDFS. There're a lot of substrings inside of
  the function what it could cause GC.
 
  Any theory about?
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 



Re: Problems with GC and time to execute with different number of executors.

2015-02-06 Thread Guillermo Ortiz
This is an execution with 80 executors

MetricMin25th percentileMedian75th percentileMax
Duration 31s 44s 50s 1.1min 2.6 min
GC Time 70ms 0.1s 0.3s 4s 53 s
Input 128.0MB 128.0MB 128.0MB 128.0MB 128.0MB

I executed as well with 40 executors
MetricMin25th percentileMedian75th percentileMax
Duration 26s 28s 28s 30s 35s
GC Time 54ms 60ms 66ms 80ms 0.4 s
Input 128.0MB 128.0MB 128.0MB 128.0MB 128.0 MB

I checked the %iowait and %steal in a worker it's all right in both of them
I understand the value of yarn.nodemanager.resource.memory-mb is for
each worker in the cluster and not the total value for YARN. it's
configured at 196GB right now. (I have 5 workers)
80executors x 4Gb = 320Gb, it shouldn't be a problem.


2015-02-06 10:03 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com:
 Yes, having many more cores than disks and all writing at the same time can
 definitely cause performance issues.  Though that wouldn't explain the high
 GC.  What percent of task time does the web UI report that tasks are
 spending in GC?

 On Fri, Feb 6, 2015 at 12:56 AM, Guillermo Ortiz konstt2...@gmail.com
 wrote:

 Yes, It's surpressing to me as well

 I tried to execute it with different configurations,

 sudo -u hdfs spark-submit  --master yarn-client --class
 com.mycompany.app.App --num-executors 40 --executor-memory 4g
 Example-1.0-SNAPSHOT.jar hdfs://ip:8020/tmp/sparkTest/ file22.bin
 parameters

 This is what I executed with different values in num-executors and
 executor-memory.
 What do you think there are too many executors for those HDDs? Could
 it be the reason because of each executor takes more time?

 2015-02-06 9:36 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com:
  That's definitely surprising to me that you would be hitting a lot of GC
  for
  this scenario.  Are you setting --executor-cores and --executor-memory?
  What are you setting them to?
 
  -Sandy
 
  On Thu, Feb 5, 2015 at 10:17 AM, Guillermo Ortiz konstt2...@gmail.com
  wrote:
 
  Any idea why if I use more containers I get a lot of stopped because
  GC?
 
  2015-02-05 8:59 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com:
   I'm not caching the data. with each iteration I mean,, each 128mb
   that a executor has to process.
  
   The code is pretty simple.
  
   final Conversor c = new Conversor(null, null, null,
   longFields,typeFields);
   SparkConf conf = new SparkConf().setAppName(Simple Application);
   JavaSparkContext sc = new JavaSparkContext(conf);
   JavaRDDbyte[] rdd = sc.binaryRecords(path, c.calculaLongBlock());
  
JavaRDDString rddString = rdd.map(new Functionbyte[], String() {
@Override
 public String call(byte[] arg0) throws Exception {
String result = c.parse(arg0).toString();
 return result;
   }
});
   rddString.saveAsTextFile(url + /output/ +
   System.currentTimeMillis()+
   /);
  
   The parse function just takes an array of bytes and applies some
   transformations like,,,
   [0..3] an integer, [4...20] an String, [21..27] another String and so
   on.
  
   It's just a test code, I'd like to understand what it's happeing.
  
   2015-02-04 18:57 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com:
   Hi Guillermo,
  
   What exactly do you mean by each iteration?  Are you caching data
   in
   memory?
  
   -Sandy
  
   On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz
   konstt2...@gmail.com
   wrote:
  
   I execute a job in Spark where I'm processing a file of 80Gb in
   HDFS.
   I have 5 slaves:
   (32cores /256Gb / 7physical disks) x 5
  
   I have been trying many different configurations with YARN.
   yarn.nodemanager.resource.memory-mb 196Gb
   yarn.nodemanager.resource.cpu-vcores 24
  
   I have tried to execute the job with different number of executors
   a
   memory (1-4g)
   With 20 executors takes 25s each iteration (128mb) and it never has
   a
   really long time waiting because GC.
  
   When I execute around 60 executors the process time it's about 45s
   and
   some tasks take until one minute because GC.
  
   I have no idea why it's calling GC when I execute more executors
   simultaneously.
   The another question it's why it takes more time to execute each
   block. My theory about the this it's because there're only 7
   physical
   disks and it's not the same 5 processes writing than 20.
  
   The code is pretty simple, it's just a map function which parse a
   line
   and write the output in HDFS. There're a lot of substrings inside
   of
   the function what it could cause GC.
  
   Any theory about?
  
  
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  
  
 
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Problems with GC and time to execute with different number of executors.

2015-02-06 Thread Guillermo Ortiz
Yes, It's surpressing to me as well

I tried to execute it with different configurations,

sudo -u hdfs spark-submit  --master yarn-client --class
com.mycompany.app.App --num-executors 40 --executor-memory 4g
Example-1.0-SNAPSHOT.jar hdfs://ip:8020/tmp/sparkTest/ file22.bin
parameters

This is what I executed with different values in num-executors and
executor-memory.
What do you think there are too many executors for those HDDs? Could
it be the reason because of each executor takes more time?

2015-02-06 9:36 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com:
 That's definitely surprising to me that you would be hitting a lot of GC for
 this scenario.  Are you setting --executor-cores and --executor-memory?
 What are you setting them to?

 -Sandy

 On Thu, Feb 5, 2015 at 10:17 AM, Guillermo Ortiz konstt2...@gmail.com
 wrote:

 Any idea why if I use more containers I get a lot of stopped because GC?

 2015-02-05 8:59 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com:
  I'm not caching the data. with each iteration I mean,, each 128mb
  that a executor has to process.
 
  The code is pretty simple.
 
  final Conversor c = new Conversor(null, null, null,
  longFields,typeFields);
  SparkConf conf = new SparkConf().setAppName(Simple Application);
  JavaSparkContext sc = new JavaSparkContext(conf);
  JavaRDDbyte[] rdd = sc.binaryRecords(path, c.calculaLongBlock());
 
   JavaRDDString rddString = rdd.map(new Functionbyte[], String() {
   @Override
public String call(byte[] arg0) throws Exception {
   String result = c.parse(arg0).toString();
return result;
  }
   });
  rddString.saveAsTextFile(url + /output/ + System.currentTimeMillis()+
  /);
 
  The parse function just takes an array of bytes and applies some
  transformations like,,,
  [0..3] an integer, [4...20] an String, [21..27] another String and so
  on.
 
  It's just a test code, I'd like to understand what it's happeing.
 
  2015-02-04 18:57 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com:
  Hi Guillermo,
 
  What exactly do you mean by each iteration?  Are you caching data in
  memory?
 
  -Sandy
 
  On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz konstt2...@gmail.com
  wrote:
 
  I execute a job in Spark where I'm processing a file of 80Gb in HDFS.
  I have 5 slaves:
  (32cores /256Gb / 7physical disks) x 5
 
  I have been trying many different configurations with YARN.
  yarn.nodemanager.resource.memory-mb 196Gb
  yarn.nodemanager.resource.cpu-vcores 24
 
  I have tried to execute the job with different number of executors a
  memory (1-4g)
  With 20 executors takes 25s each iteration (128mb) and it never has a
  really long time waiting because GC.
 
  When I execute around 60 executors the process time it's about 45s and
  some tasks take until one minute because GC.
 
  I have no idea why it's calling GC when I execute more executors
  simultaneously.
  The another question it's why it takes more time to execute each
  block. My theory about the this it's because there're only 7 physical
  disks and it's not the same 5 processes writing than 20.
 
  The code is pretty simple, it's just a map function which parse a line
  and write the output in HDFS. There're a lot of substrings inside of
  the function what it could cause GC.
 
  Any theory about?
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Problems with GC and time to execute with different number of executors.

2015-02-06 Thread Sandy Ryza
Yes, having many more cores than disks and all writing at the same time can
definitely cause performance issues.  Though that wouldn't explain the high
GC.  What percent of task time does the web UI report that tasks are
spending in GC?

On Fri, Feb 6, 2015 at 12:56 AM, Guillermo Ortiz konstt2...@gmail.com
wrote:

 Yes, It's surpressing to me as well

 I tried to execute it with different configurations,

 sudo -u hdfs spark-submit  --master yarn-client --class
 com.mycompany.app.App --num-executors 40 --executor-memory 4g
 Example-1.0-SNAPSHOT.jar hdfs://ip:8020/tmp/sparkTest/ file22.bin
 parameters

 This is what I executed with different values in num-executors and
 executor-memory.
 What do you think there are too many executors for those HDDs? Could
 it be the reason because of each executor takes more time?

 2015-02-06 9:36 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com:
  That's definitely surprising to me that you would be hitting a lot of GC
 for
  this scenario.  Are you setting --executor-cores and --executor-memory?
  What are you setting them to?
 
  -Sandy
 
  On Thu, Feb 5, 2015 at 10:17 AM, Guillermo Ortiz konstt2...@gmail.com
  wrote:
 
  Any idea why if I use more containers I get a lot of stopped because GC?
 
  2015-02-05 8:59 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com:
   I'm not caching the data. with each iteration I mean,, each 128mb
   that a executor has to process.
  
   The code is pretty simple.
  
   final Conversor c = new Conversor(null, null, null,
   longFields,typeFields);
   SparkConf conf = new SparkConf().setAppName(Simple Application);
   JavaSparkContext sc = new JavaSparkContext(conf);
   JavaRDDbyte[] rdd = sc.binaryRecords(path, c.calculaLongBlock());
  
JavaRDDString rddString = rdd.map(new Functionbyte[], String() {
@Override
 public String call(byte[] arg0) throws Exception {
String result = c.parse(arg0).toString();
 return result;
   }
});
   rddString.saveAsTextFile(url + /output/ +
 System.currentTimeMillis()+
   /);
  
   The parse function just takes an array of bytes and applies some
   transformations like,,,
   [0..3] an integer, [4...20] an String, [21..27] another String and so
   on.
  
   It's just a test code, I'd like to understand what it's happeing.
  
   2015-02-04 18:57 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com:
   Hi Guillermo,
  
   What exactly do you mean by each iteration?  Are you caching data
 in
   memory?
  
   -Sandy
  
   On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz 
 konstt2...@gmail.com
   wrote:
  
   I execute a job in Spark where I'm processing a file of 80Gb in
 HDFS.
   I have 5 slaves:
   (32cores /256Gb / 7physical disks) x 5
  
   I have been trying many different configurations with YARN.
   yarn.nodemanager.resource.memory-mb 196Gb
   yarn.nodemanager.resource.cpu-vcores 24
  
   I have tried to execute the job with different number of executors a
   memory (1-4g)
   With 20 executors takes 25s each iteration (128mb) and it never has
 a
   really long time waiting because GC.
  
   When I execute around 60 executors the process time it's about 45s
 and
   some tasks take until one minute because GC.
  
   I have no idea why it's calling GC when I execute more executors
   simultaneously.
   The another question it's why it takes more time to execute each
   block. My theory about the this it's because there're only 7
 physical
   disks and it's not the same 5 processes writing than 20.
  
   The code is pretty simple, it's just a map function which parse a
 line
   and write the output in HDFS. There're a lot of substrings inside of
   the function what it could cause GC.
  
   Any theory about?
  
  
 -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  
  
 
 



Re: Problems with GC and time to execute with different number of executors.

2015-02-05 Thread Guillermo Ortiz
I'm not caching the data. with each iteration I mean,, each 128mb
that a executor has to process.

The code is pretty simple.

final Conversor c = new Conversor(null, null, null, longFields,typeFields);
SparkConf conf = new SparkConf().setAppName(Simple Application);
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDDbyte[] rdd = sc.binaryRecords(path, c.calculaLongBlock());

 JavaRDDString rddString = rdd.map(new Functionbyte[], String() {
 @Override
  public String call(byte[] arg0) throws Exception {
 String result = c.parse(arg0).toString();
  return result;
}
 });
rddString.saveAsTextFile(url + /output/ + System.currentTimeMillis()+ /);

The parse function just takes an array of bytes and applies some
transformations like,,,
[0..3] an integer, [4...20] an String, [21..27] another String and so on.

It's just a test code, I'd like to understand what it's happeing.

2015-02-04 18:57 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com:
 Hi Guillermo,

 What exactly do you mean by each iteration?  Are you caching data in
 memory?

 -Sandy

 On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz konstt2...@gmail.com
 wrote:

 I execute a job in Spark where I'm processing a file of 80Gb in HDFS.
 I have 5 slaves:
 (32cores /256Gb / 7physical disks) x 5

 I have been trying many different configurations with YARN.
 yarn.nodemanager.resource.memory-mb 196Gb
 yarn.nodemanager.resource.cpu-vcores 24

 I have tried to execute the job with different number of executors a
 memory (1-4g)
 With 20 executors takes 25s each iteration (128mb) and it never has a
 really long time waiting because GC.

 When I execute around 60 executors the process time it's about 45s and
 some tasks take until one minute because GC.

 I have no idea why it's calling GC when I execute more executors
 simultaneously.
 The another question it's why it takes more time to execute each
 block. My theory about the this it's because there're only 7 physical
 disks and it's not the same 5 processes writing than 20.

 The code is pretty simple, it's just a map function which parse a line
 and write the output in HDFS. There're a lot of substrings inside of
 the function what it could cause GC.

 Any theory about?

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Problems with GC and time to execute with different number of executors.

2015-02-05 Thread Guillermo Ortiz
Any idea why if I use more containers I get a lot of stopped because GC?

2015-02-05 8:59 GMT+01:00 Guillermo Ortiz konstt2...@gmail.com:
 I'm not caching the data. with each iteration I mean,, each 128mb
 that a executor has to process.

 The code is pretty simple.

 final Conversor c = new Conversor(null, null, null, longFields,typeFields);
 SparkConf conf = new SparkConf().setAppName(Simple Application);
 JavaSparkContext sc = new JavaSparkContext(conf);
 JavaRDDbyte[] rdd = sc.binaryRecords(path, c.calculaLongBlock());

  JavaRDDString rddString = rdd.map(new Functionbyte[], String() {
  @Override
   public String call(byte[] arg0) throws Exception {
  String result = c.parse(arg0).toString();
   return result;
 }
  });
 rddString.saveAsTextFile(url + /output/ + System.currentTimeMillis()+ /);

 The parse function just takes an array of bytes and applies some
 transformations like,,,
 [0..3] an integer, [4...20] an String, [21..27] another String and so on.

 It's just a test code, I'd like to understand what it's happeing.

 2015-02-04 18:57 GMT+01:00 Sandy Ryza sandy.r...@cloudera.com:
 Hi Guillermo,

 What exactly do you mean by each iteration?  Are you caching data in
 memory?

 -Sandy

 On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz konstt2...@gmail.com
 wrote:

 I execute a job in Spark where I'm processing a file of 80Gb in HDFS.
 I have 5 slaves:
 (32cores /256Gb / 7physical disks) x 5

 I have been trying many different configurations with YARN.
 yarn.nodemanager.resource.memory-mb 196Gb
 yarn.nodemanager.resource.cpu-vcores 24

 I have tried to execute the job with different number of executors a
 memory (1-4g)
 With 20 executors takes 25s each iteration (128mb) and it never has a
 really long time waiting because GC.

 When I execute around 60 executors the process time it's about 45s and
 some tasks take until one minute because GC.

 I have no idea why it's calling GC when I execute more executors
 simultaneously.
 The another question it's why it takes more time to execute each
 block. My theory about the this it's because there're only 7 physical
 disks and it's not the same 5 processes writing than 20.

 The code is pretty simple, it's just a map function which parse a line
 and write the output in HDFS. There're a lot of substrings inside of
 the function what it could cause GC.

 Any theory about?

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Problems with GC and time to execute with different number of executors.

2015-02-04 Thread Guillermo Ortiz
I execute a job in Spark where I'm processing a file of 80Gb in HDFS.
I have 5 slaves:
(32cores /256Gb / 7physical disks) x 5

I have been trying many different configurations with YARN.
yarn.nodemanager.resource.memory-mb 196Gb
yarn.nodemanager.resource.cpu-vcores 24

I have tried to execute the job with different number of executors a
memory (1-4g)
With 20 executors takes 25s each iteration (128mb) and it never has a
really long time waiting because GC.

When I execute around 60 executors the process time it's about 45s and
some tasks take until one minute because GC.

I have no idea why it's calling GC when I execute more executors simultaneously.
The another question it's why it takes more time to execute each
block. My theory about the this it's because there're only 7 physical
disks and it's not the same 5 processes writing than 20.

The code is pretty simple, it's just a map function which parse a line
and write the output in HDFS. There're a lot of substrings inside of
the function what it could cause GC.

Any theory about?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org