Re: long GC pause during file.cache()

2014-06-16 Thread Wei Tan
Thanks you all for advice including (1) using CMS GC (2) use multiple 
worker instance and (3) use Tachyon.

I will try (1) and (2) first and report back what I found.

I will also try JDK 7 with G1 GC.

Best regards,
Wei

-
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan



From:   Aaron Davidson ilike...@gmail.com
To: user@spark.apache.org, 
Date:   06/15/2014 09:06 PM
Subject:Re: long GC pause during file.cache()



Note also that Java does not work well with very large JVMs due to this 
exact issue. There are two commonly used workarounds:

1) Spawn multiple (smaller) executors on the same machine. This can be 
done by creating multiple Workers (via SPARK_WORKER_INSTANCES in 
standalone mode[1]).
2) Use Tachyon for off-heap caching of RDDs, allowing Spark executors to 
be smaller and avoid GC pauses

[1] See standalone documentation here: 
http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts


On Sun, Jun 15, 2014 at 3:50 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
Yes, I think in the spark-env.sh.template, it is listed in the comments 
(didn’t check….) 

Best,

-- 
Nan Zhu

On Sunday, June 15, 2014 at 5:21 PM, Surendranauth Hiraman wrote:
Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0?



On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you 
don’t mind the WARNING in the logs

you can set spark.executor.extraJavaOpts in your SparkConf obj

Best,

-- 
Nan Zhu

On Sunday, June 15, 2014 at 12:13 PM, Hao Wang wrote:
Hi, Wei

You may try to set JVM opts in spark-env.sh as follow to prevent or 
mitigate GC pause:

export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC 
-Xmx2g -XX:MaxPermSize=256m

There are more options you could add, please just Google :) 


Regards,
Wang Hao(王灏)

CloudTeam | School of Software Engineering
Shanghai Jiao Tong University
Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
Email:wh.s...@gmail.com


On Sun, Jun 15, 2014 at 10:24 AM, Wei Tan w...@us.ibm.com wrote:
Hi, 

  I have a single node (192G RAM) stand-alone spark, with memory 
configuration like this in spark-env.sh 

SPARK_WORKER_MEMORY=180g 
SPARK_MEM=180g 


 In spark-shell I have a program like this: 

val file = sc.textFile(/localpath) //file size is 40G 
file.cache() 


val output = file.map(line = extract something from line) 

output.saveAsTextFile (...) 


When I run this program again and again, or keep trying file.unpersist() 
-- file.cache() -- output.saveAsTextFile(), the run time varies a lot, 
from 1 min to 3 min to 50+ min. Whenever the run-time is more than 1 min, 
from the stage monitoring GUI I observe big GC pause (some can be 10+ 
min). Of course when run-time is normal, say ~1 min, no significant GC 
is observed. The behavior seems somewhat random. 

Is there any JVM tuning I should do to prevent this long GC pause from 
happening? 



I used java-1.6.0-openjdk.x86_64, and my spark-shell process is something 
like this: 

root 10994  1.7  0.6 196378000 1361496 pts/51 Sl+ 22:06   0:12 
/usr/lib/jvm/java-1.6.0-openjdk.x86_64/bin/java -cp 
::/home/wtan/scala/spark-1.0.0-bin-hadoop1/conf:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-core-3.2.2.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-rdbms-3.2.1.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-api-jdo-3.2.1.jar
 
-XX:MaxPermSize=128m -Djava.library.path= -Xms180g -Xmx180g 
org.apache.spark.deploy.SparkSubmit spark-shell --class 
org.apache.spark.repl.Main 

Best regards, 
Wei 

- 
Wei Tan, PhD 
Research Staff Member 
IBM T. J. Watson Research Center 
http://researcher.ibm.com/person/us-wtan





-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hira...@velos.io
W: www.velos.io






Re: long GC pause during file.cache()

2014-06-16 Thread Wei Tan
BTW: nowadays a single machine with huge RAM (200G to 1T) is really 
common. With virtualization you lose some performance. It would be ideal 
to see some best practice on how to use Spark in these state-of-art 
machines...

Best regards,
Wei

-
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan



From:   Wei Tan/Watson/IBM@IBMUS
To: user@spark.apache.org, 
Date:   06/16/2014 10:47 AM
Subject:Re: long GC pause during file.cache()



Thanks you all for advice including (1) using CMS GC (2) use multiple 
worker instance and (3) use Tachyon. 

I will try (1) and (2) first and report back what I found. 

I will also try JDK 7 with G1 GC. 

Best regards, 
Wei 

- 
Wei Tan, PhD 
Research Staff Member 
IBM T. J. Watson Research Center 
http://researcher.ibm.com/person/us-wtan 



From:Aaron Davidson ilike...@gmail.com 
To:user@spark.apache.org, 
Date:06/15/2014 09:06 PM 
Subject:Re: long GC pause during file.cache() 



Note also that Java does not work well with very large JVMs due to this 
exact issue. There are two commonly used workarounds: 

1) Spawn multiple (smaller) executors on the same machine. This can be 
done by creating multiple Workers (via SPARK_WORKER_INSTANCES in 
standalone mode[1]). 
2) Use Tachyon for off-heap caching of RDDs, allowing Spark executors to 
be smaller and avoid GC pauses 

[1] See standalone documentation here: 
http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts
 



On Sun, Jun 15, 2014 at 3:50 PM, Nan Zhu zhunanmcg...@gmail.com wrote: 
Yes, I think in the spark-env.sh.template, it is listed in the comments 
(didn’t check….) 

Best, 

--  
Nan Zhu 
On Sunday, June 15, 2014 at 5:21 PM, Surendranauth Hiraman wrote: 
Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0? 



On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu zhunanmcg...@gmail.com wrote: 
SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you don’t 
mind the WARNING in the logs 

you can set spark.executor.extraJavaOpts in your SparkConf obj 

Best, 

-- 
Nan Zhu 
On Sunday, June 15, 2014 at 12:13 PM, Hao Wang wrote: 
Hi, Wei 

You may try to set JVM opts in spark-env.sh as follow to prevent or 
mitigate GC pause: 

export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC 
-Xmx2g -XX:MaxPermSize=256m 

There are more options you could add, please just Google :) 


Regards, 
Wang Hao(王灏) 

CloudTeam | School of Software Engineering 
Shanghai Jiao Tong University 
Address:800 Dongchuan Road, Minhang District, Shanghai, 200240 
Email:wh.s...@gmail.com 


On Sun, Jun 15, 2014 at 10:24 AM, Wei Tan w...@us.ibm.com wrote: 
Hi, 

  I have a single node (192G RAM) stand-alone spark, with memory 
configuration like this in spark-env.sh 

SPARK_WORKER_MEMORY=180g 
SPARK_MEM=180g 


 In spark-shell I have a program like this: 

val file = sc.textFile(/localpath) //file size is 40G 
file.cache() 


val output = file.map(line = extract something from line) 

output.saveAsTextFile (...) 


When I run this program again and again, or keep trying file.unpersist() 
-- file.cache() -- output.saveAsTextFile(), the run time varies a lot, 
from 1 min to 3 min to 50+ min. Whenever the run-time is more than 1 min, 
from the stage monitoring GUI I observe big GC pause (some can be 10+ 
min). Of course when run-time is normal, say ~1 min, no significant GC 
is observed. The behavior seems somewhat random. 

Is there any JVM tuning I should do to prevent this long GC pause from 
happening? 



I used java-1.6.0-openjdk.x86_64, and my spark-shell process is something 
like this: 

root 10994  1.7  0.6 196378000 1361496 pts/51 Sl+ 22:06   0:12 
/usr/lib/jvm/java-1.6.0-openjdk.x86_64/bin/java -cp 
::/home/wtan/scala/spark-1.0.0-bin-hadoop1/conf:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-core-3.2.2.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-rdbms-3.2.1.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-api-jdo-3.2.1.jar
 
-XX:MaxPermSize=128m -Djava.library.path= -Xms180g -Xmx180g 
org.apache.spark.deploy.SparkSubmit spark-shell --class 
org.apache.spark.repl.Main 

Best regards, 
Wei 

- 
Wei Tan, PhD 
Research Staff Member 
IBM T. J. Watson Research Center 
http://researcher.ibm.com/person/us-wtan 





-- 
 
SUREN HIRAMAN, VP TECHNOLOGY
Velos 
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105 
F: 646.349.4063
E: suren.hira...@velos.io
W: www.velos.io






Re: long GC pause during file.cache()

2014-06-15 Thread Hao Wang
Hi, Wei

You may try to set JVM opts in *spark-env.sh* as follow to prevent or
mitigate GC pause:

export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC
-Xmx2g -XX:MaxPermSize=256m

There are more options you could add, please just Google :)


Regards,
Wang Hao(王灏)

CloudTeam | School of Software Engineering
Shanghai Jiao Tong University
Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
Email:wh.s...@gmail.com


On Sun, Jun 15, 2014 at 10:24 AM, Wei Tan w...@us.ibm.com wrote:

 Hi,

   I have a single node (192G RAM) stand-alone spark, with memory
 configuration like this in spark-env.sh

 SPARK_WORKER_MEMORY=180g
 SPARK_MEM=180g


  In spark-shell I have a program like this:

 val file = sc.textFile(/localpath) //file size is 40G
 file.cache()


 val output = file.map(line = extract something from line)

 output.saveAsTextFile (...)


 When I run this program again and again, or keep trying file.unpersist()
 -- file.cache() -- output.saveAsTextFile(), the run time varies a lot,
 from 1 min to 3 min to 50+ min. Whenever the run-time is more than 1 min,
 from the stage monitoring GUI I observe big GC pause (some can be 10+ min).
 Of course when run-time is normal, say ~1 min, no significant GC is
 observed. The behavior seems somewhat random.

 Is there any JVM tuning I should do to prevent this long GC pause from
 happening?



 I used java-1.6.0-openjdk.x86_64, and my spark-shell process is something
 like this:

 root 10994  1.7  0.6 196378000 1361496 pts/51 Sl+ 22:06   0:12
 /usr/lib/jvm/java-1.6.0-openjdk.x86_64/bin/java -cp
 ::/home/wtan/scala/spark-1.0.0-bin-hadoop1/conf:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-core-3.2.2.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-rdbms-3.2.1.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-api-jdo-3.2.1.jar
 -XX:MaxPermSize=128m -Djava.library.path= -Xms180g -Xmx180g
 org.apache.spark.deploy.SparkSubmit spark-shell --class
 org.apache.spark.repl.Main

 Best regards,
 Wei

 -
 Wei Tan, PhD
 Research Staff Member
 IBM T. J. Watson Research Center
 *http://researcher.ibm.com/person/us-wtan*
 http://researcher.ibm.com/person/us-wtan


Re: long GC pause during file.cache()

2014-06-15 Thread Nan Zhu
SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you don’t mind 
the WARNING in the logs

you can set spark.executor.extraJavaOpts in your SparkConf obj  

Best,

--  
Nan Zhu


On Sunday, June 15, 2014 at 12:13 PM, Hao Wang wrote:

 Hi, Wei
  
 You may try to set JVM opts in spark-env.sh (http://spark-env.sh) as follow 
 to prevent or mitigate GC pause:
  
 export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC 
 -Xmx2g -XX:MaxPermSize=256m
  
 There are more options you could add, please just Google :)  
  
  
 Regards,
 Wang Hao(王灏)
  
 CloudTeam | School of Software Engineering
 Shanghai Jiao Tong University
 Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
 Email:wh.s...@gmail.com (mailto:wh.s...@gmail.com)
  
  
  
  
  
  
 On Sun, Jun 15, 2014 at 10:24 AM, Wei Tan w...@us.ibm.com 
 (mailto:w...@us.ibm.com) wrote:
  Hi,  
   
I have a single node (192G RAM) stand-alone spark, with memory 
  configuration like this in spark-env.sh (http://spark-env.sh)  
   
  SPARK_WORKER_MEMORY=180g  
  SPARK_MEM=180g  
   
   
   In spark-shell I have a program like this:  
   
  val file = sc.textFile(/localpath) //file size is 40G  
  file.cache()  
   
   
  val output = file.map(line = extract something from line)  
   
  output.saveAsTextFile (...)  
   
   
  When I run this program again and again, or keep trying file.unpersist() 
  -- file.cache() -- output.saveAsTextFile(), the run time varies a lot, 
  from 1 min to 3 min to 50+ min. Whenever the run-time is more than 1 min, 
  from the stage monitoring GUI I observe big GC pause (some can be 10+ min). 
  Of course when run-time is normal, say ~1 min, no significant GC is 
  observed. The behavior seems somewhat random.  
   
  Is there any JVM tuning I should do to prevent this long GC pause from 
  happening?  
   
   
   
  I used java-1.6.0-openjdk.x86_64, and my spark-shell process is something 
  like this:  
   
  root 10994  1.7  0.6 196378000 1361496 pts/51 Sl+ 22:06   0:12 
  /usr/lib/jvm/java-1.6.0-openjdk.x86_64/bin/java -cp 
  ::/home/wtan/scala/spark-1.0.0-bin-hadoop1/conf:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-core-3.2.2.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-rdbms-3.2.1.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-api-jdo-3.2.1.jar
   -XX:MaxPermSize=128m -Djava.library.path= -Xms180g -Xmx180g 
  org.apache.spark.deploy.SparkSubmit spark-shell --class 
  org.apache.spark.repl.Main  
   
  Best regards,  
  Wei  
   
  -  
  Wei Tan, PhD  
  Research Staff Member  
  IBM T. J. Watson Research Center  
  http://researcher.ibm.com/person/us-wtan



Re: long GC pause during file.cache()

2014-06-15 Thread Surendranauth Hiraman
Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0?



On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

  SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you
 don’t mind the WARNING in the logs

 you can set spark.executor.extraJavaOpts in your SparkConf obj

 Best,

 --
 Nan Zhu

 On Sunday, June 15, 2014 at 12:13 PM, Hao Wang wrote:

 Hi, Wei

 You may try to set JVM opts in *spark-env.sh http://spark-env.sh* as
 follow to prevent or mitigate GC pause:

 export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC
 -Xmx2g -XX:MaxPermSize=256m

 There are more options you could add, please just Google :)


 Regards,
 Wang Hao(王灏)

 CloudTeam | School of Software Engineering
 Shanghai Jiao Tong University
 Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
 Email:wh.s...@gmail.com


 On Sun, Jun 15, 2014 at 10:24 AM, Wei Tan w...@us.ibm.com wrote:

 Hi,

   I have a single node (192G RAM) stand-alone spark, with memory
 configuration like this in spark-env.sh

 SPARK_WORKER_MEMORY=180g
 SPARK_MEM=180g


  In spark-shell I have a program like this:

 val file = sc.textFile(/localpath) //file size is 40G
 file.cache()


 val output = file.map(line = extract something from line)

 output.saveAsTextFile (...)


 When I run this program again and again, or keep trying file.unpersist()
 -- file.cache() -- output.saveAsTextFile(), the run time varies a lot,
 from 1 min to 3 min to 50+ min. Whenever the run-time is more than 1 min,
 from the stage monitoring GUI I observe big GC pause (some can be 10+ min).
 Of course when run-time is normal, say ~1 min, no significant GC is
 observed. The behavior seems somewhat random.

 Is there any JVM tuning I should do to prevent this long GC pause from
 happening?



 I used java-1.6.0-openjdk.x86_64, and my spark-shell process is something
 like this:

 root 10994  1.7  0.6 196378000 1361496 pts/51 Sl+ 22:06   0:12
 /usr/lib/jvm/java-1.6.0-openjdk.x86_64/bin/java -cp
 ::/home/wtan/scala/spark-1.0.0-bin-hadoop1/conf:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-core-3.2.2.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-rdbms-3.2.1.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-api-jdo-3.2.1.jar
 -XX:MaxPermSize=128m -Djava.library.path= -Xms180g -Xmx180g
 org.apache.spark.deploy.SparkSubmit spark-shell --class
 org.apache.spark.repl.Main

 Best regards,
 Wei

 -
 Wei Tan, PhD
 Research Staff Member
 IBM T. J. Watson Research Center
 *http://researcher.ibm.com/person/us-wtan*
 http://researcher.ibm.com/person/us-wtan






-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v suren.hira...@sociocast.comelos.io
W: www.velos.io


Re: long GC pause during file.cache()

2014-06-15 Thread Nan Zhu
Yes, I think in the spark-env.sh.template, it is listed in the comments (didn’t 
check….)  

Best,  

--  
Nan Zhu


On Sunday, June 15, 2014 at 5:21 PM, Surendranauth Hiraman wrote:

 Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0?
  
  
  
 On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu zhunanmcg...@gmail.com 
 (mailto:zhunanmcg...@gmail.com) wrote:
  SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you don’t 
  mind the WARNING in the logs
   
  you can set spark.executor.extraJavaOpts in your SparkConf obj  
   
  Best,
   
  --  
  Nan Zhu
   
   
  On Sunday, June 15, 2014 at 12:13 PM, Hao Wang wrote:
   
   Hi, Wei

   You may try to set JVM opts in spark-env.sh (http://spark-env.sh) as 
   follow to prevent or mitigate GC pause:  

   export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC 
   -Xmx2g -XX:MaxPermSize=256m

   There are more options you could add, please just Google :)  


   Regards,
   Wang Hao(王灏)

   CloudTeam | School of Software Engineering
   Shanghai Jiao Tong University
   Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
   Email:wh.s...@gmail.com (mailto:wh.s...@gmail.com)






   On Sun, Jun 15, 2014 at 10:24 AM, Wei Tan w...@us.ibm.com 
   (mailto:w...@us.ibm.com) wrote:
Hi,  
 
  I have a single node (192G RAM) stand-alone spark, with memory 
configuration like this in spark-env.sh (http://spark-env.sh)  
 
SPARK_WORKER_MEMORY=180g  
SPARK_MEM=180g  
 
 
 In spark-shell I have a program like this:  
 
val file = sc.textFile(/localpath) //file size is 40G  
file.cache()  
 
 
val output = file.map(line = extract something from line)  
 
output.saveAsTextFile (...)  
 
 
When I run this program again and again, or keep trying 
file.unpersist() -- file.cache() -- output.saveAsTextFile(), the run 
time varies a lot, from 1 min to 3 min to 50+ min. Whenever the 
run-time is more than 1 min, from the stage monitoring GUI I observe 
big GC pause (some can be 10+ min). Of course when run-time is 
normal, say ~1 min, no significant GC is observed. The behavior seems 
somewhat random.  
 
Is there any JVM tuning I should do to prevent this long GC pause from 
happening?  
 
 
 
I used java-1.6.0-openjdk.x86_64, and my spark-shell process is 
something like this:  
 
root 10994  1.7  0.6 196378000 1361496 pts/51 Sl+ 22:06   0:12 
/usr/lib/jvm/java-1.6.0-openjdk.x86_64/bin/java -cp 
::/home/wtan/scala/spark-1.0.0-bin-hadoop1/conf:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-core-3.2.2.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-rdbms-3.2.1.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-api-jdo-3.2.1.jar
 -XX:MaxPermSize=128m -Djava.library.path= -Xms180g -Xmx180g 
org.apache.spark.deploy.SparkSubmit spark-shell --class 
org.apache.spark.repl.Main  
 
Best regards,  
Wei  
 
-  
Wei Tan, PhD  
Research Staff Member  
IBM T. J. Watson Research Center  
http://researcher.ibm.com/person/us-wtan
   
  
  
  
 --  
  SUREN HIRAMAN, 
 VP TECHNOLOGY
 Velos
 Accelerating Machine Learning
  
 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v (mailto:suren.hira...@sociocast.com)elos.io 
 (http://elos.io)
 W: www.velos.io (http://www.velos.io/)
  



Re: long GC pause during file.cache()

2014-06-15 Thread Aaron Davidson
Note also that Java does not work well with very large JVMs due to this
exact issue. There are two commonly used workarounds:

1) Spawn multiple (smaller) executors on the same machine. This can be done
by creating multiple Workers (via SPARK_WORKER_INSTANCES in standalone
mode[1]).
2) Use Tachyon for off-heap caching of RDDs, allowing Spark executors to be
smaller and avoid GC pauses

[1] See standalone documentation here:
http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts


On Sun, Jun 15, 2014 at 3:50 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

  Yes, I think in the spark-env.sh.template, it is listed in the comments
 (didn’t check….)

 Best,

 --
 Nan Zhu

 On Sunday, June 15, 2014 at 5:21 PM, Surendranauth Hiraman wrote:

 Is SPARK_DAEMON_JAVA_OPTS valid in 1.0.0?



 On Sun, Jun 15, 2014 at 4:59 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

  SPARK_JAVA_OPTS is deprecated in 1.0, though it works fine if you
 don’t mind the WARNING in the logs

 you can set spark.executor.extraJavaOpts in your SparkConf obj

 Best,

 --
 Nan Zhu

 On Sunday, June 15, 2014 at 12:13 PM, Hao Wang wrote:

 Hi, Wei

 You may try to set JVM opts in *spark-env.sh http://spark-env.sh* as
 follow to prevent or mitigate GC pause:

 export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC
 -Xmx2g -XX:MaxPermSize=256m

 There are more options you could add, please just Google :)


 Regards,
 Wang Hao(王灏)

 CloudTeam | School of Software Engineering
 Shanghai Jiao Tong University
 Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
 Email:wh.s...@gmail.com


 On Sun, Jun 15, 2014 at 10:24 AM, Wei Tan w...@us.ibm.com wrote:

 Hi,

   I have a single node (192G RAM) stand-alone spark, with memory
 configuration like this in spark-env.sh

 SPARK_WORKER_MEMORY=180g
 SPARK_MEM=180g


  In spark-shell I have a program like this:

 val file = sc.textFile(/localpath) //file size is 40G
 file.cache()


 val output = file.map(line = extract something from line)

 output.saveAsTextFile (...)


 When I run this program again and again, or keep trying file.unpersist()
 -- file.cache() -- output.saveAsTextFile(), the run time varies a lot,
 from 1 min to 3 min to 50+ min. Whenever the run-time is more than 1 min,
 from the stage monitoring GUI I observe big GC pause (some can be 10+ min).
 Of course when run-time is normal, say ~1 min, no significant GC is
 observed. The behavior seems somewhat random.

 Is there any JVM tuning I should do to prevent this long GC pause from
 happening?



 I used java-1.6.0-openjdk.x86_64, and my spark-shell process is something
 like this:

 root 10994  1.7  0.6 196378000 1361496 pts/51 Sl+ 22:06   0:12
 /usr/lib/jvm/java-1.6.0-openjdk.x86_64/bin/java -cp
 ::/home/wtan/scala/spark-1.0.0-bin-hadoop1/conf:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-core-3.2.2.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-rdbms-3.2.1.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-api-jdo-3.2.1.jar
 -XX:MaxPermSize=128m -Djava.library.path= -Xms180g -Xmx180g
 org.apache.spark.deploy.SparkSubmit spark-shell --class
 org.apache.spark.repl.Main

 Best regards,
 Wei

 -
 Wei Tan, PhD
 Research Staff Member
 IBM T. J. Watson Research Center
 *http://researcher.ibm.com/person/us-wtan*
 http://researcher.ibm.com/person/us-wtan






 --

 SUREN HIRAMAN, VP TECHNOLOGY
 Velos
 Accelerating Machine Learning

 440 NINTH AVENUE, 11TH FLOOR
 NEW YORK, NY 10001
 O: (917) 525-2466 ext. 105
 F: 646.349.4063
 E: suren.hiraman@v suren.hira...@sociocast.comelos.io
 W: www.velos.io





long GC pause during file.cache()

2014-06-14 Thread Wei Tan
Hi,

  I have a single node (192G RAM) stand-alone spark, with memory 
configuration like this in spark-env.sh

SPARK_WORKER_MEMORY=180g
SPARK_MEM=180g


 In spark-shell I have a program like this:

val file = sc.textFile(/localpath) //file size is 40G
file.cache()


val output = file.map(line = extract something from line)

output.saveAsTextFile (...)


When I run this program again and again, or keep trying file.unpersist() 
-- file.cache() -- output.saveAsTextFile(), the run time varies a lot, 
from 1 min to 3 min to 50+ min. Whenever the run-time is more than 1 min, 
from the stage monitoring GUI I observe big GC pause (some can be 10+ 
min). Of course when run-time is normal, say ~1 min, no significant GC 
is observed. The behavior seems somewhat random.

Is there any JVM tuning I should do to prevent this long GC pause from 
happening? 



I used java-1.6.0-openjdk.x86_64, and my spark-shell process is something 
like this:

root 10994  1.7  0.6 196378000 1361496 pts/51 Sl+ 22:06   0:12 
/usr/lib/jvm/java-1.6.0-openjdk.x86_64/bin/java -cp 
::/home/wtan/scala/spark-1.0.0-bin-hadoop1/conf:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-core-3.2.2.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-rdbms-3.2.1.jar:/home/wtan/scala/spark-1.0.0-bin-hadoop1/lib/datanucleus-api-jdo-3.2.1.jar
 
-XX:MaxPermSize=128m -Djava.library.path= -Xms180g -Xmx180g 
org.apache.spark.deploy.SparkSubmit spark-shell --class 
org.apache.spark.repl.Main

Best regards,
Wei

-
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan