Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-14 Thread Marius Soutier
It cleans the work dir, and SPARK_LOCAL_DIRS should be cleaned automatically. 
From the source code comments:
// SPARK_LOCAL_DIRS environment variable, and deleted by the Worker when the
// application finishes.


 On 13.04.2015, at 11:26, Guillaume Pitel guillaume.pi...@exensa.com wrote:
 
 Does it also cleanup spark local dirs ? I thought it was only cleaning 
 $SPARK_HOME/work/
 
 Guillaume
 I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example:
 
 export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true 
 -Dspark.worker.cleanup.appDataTtl=seconds
 
 On 11.04.2015, at 00:01, Wang, Ningjun (LNG-NPV) 
 ningjun.w...@lexisnexis.com mailto:ningjun.w...@lexisnexis.com wrote:
 
 Does anybody have an answer for this?
  
 Thanks
 Ningjun
  
 From: Wang, Ningjun (LNG-NPV) 
 Sent: Thursday, April 02, 2015 12:14 PM
 To: user@spark.apache.org mailto:user@spark.apache.org
 Subject: Is the disk space in SPARK_LOCAL_DIRS cleanned up?
  
 I set SPARK_LOCAL_DIRS   to   C:\temp\spark-temp. When RDDs are shuffled, 
 spark writes to this folder. I found that the disk space of this folder 
 keep on increase quickly and at certain point I will run out of disk space. 
  
 I wonder does spark clean up the disk space in this folder once the shuffle 
 operation is done? If not, I need to write a job to clean it up myself. But 
 how do I know which sub folders there can be removed?
  
 Ningjun
 
 
 
 -- 
 exensa_logo_mail.png
 Guillaume PITEL, Président 
 +33(0)626 222 431
 
 eXenSa S.A.S. http://www.exensa.com/ 
 41, rue Périer - 92120 Montrouge - FRANCE 
 Tel +33(0)184 163 677 / Fax +33(0)972 283 705



Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-14 Thread Marius Soutier
That’s true, spill dirs don’t get cleaned up when something goes wrong. We are 
are restarting long running jobs once in a while for cleanups and have 
spark.cleaner.ttl set to a lower value than the default.

 On 14.04.2015, at 17:57, Guillaume Pitel guillaume.pi...@exensa.com wrote:
 
 Right, I remember now, the only problematic case is when things go bad and 
 the cleaner is not executed.
 
 Also, it can be a problem when reusing the same sparkcontext for many runs.
 
 Guillaume
 It cleans the work dir, and SPARK_LOCAL_DIRS should be cleaned 
 automatically. From the source code comments:
 // SPARK_LOCAL_DIRS environment variable, and deleted by the Worker when the
 // application finishes.
 
 
 On 13.04.2015, at 11:26, Guillaume Pitel guillaume.pi...@exensa.com 
 mailto:guillaume.pi...@exensa.com wrote:
 
 Does it also cleanup spark local dirs ? I thought it was only cleaning 
 $SPARK_HOME/work/
 
 Guillaume
 I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example:
 
 export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true 
 -Dspark.worker.cleanup.appDataTtl=seconds
 
 On 11.04.2015, at 00:01, Wang, Ningjun (LNG-NPV) 
 ningjun.w...@lexisnexis.com mailto:ningjun.w...@lexisnexis.com wrote:
 
 Does anybody have an answer for this?
  
 Thanks
 Ningjun
  
 From: Wang, Ningjun (LNG-NPV) 
 Sent: Thursday, April 02, 2015 12:14 PM
 To: user@spark.apache.org mailto:user@spark.apache.org
 Subject: Is the disk space in SPARK_LOCAL_DIRS cleanned up?
  
 I set SPARK_LOCAL_DIRS   to   C:\temp\spark-temp. When RDDs are shuffled, 
 spark writes to this folder. I found that the disk space of this folder 
 keep on increase quickly and at certain point I will run out of disk 
 space. 
  
 I wonder does spark clean up the disk space in this folder once the 
 shuffle operation is done? If not, I need to write a job to clean it up 
 myself. But how do I know which sub folders there can be removed?
  
 Ningjun
 
 
 
 -- 
 exensa_logo_mail.png
 Guillaume PITEL, Président 
 +33(0)626 222 431
 
 eXenSa S.A.S. http://www.exensa.com/ 
 41, rue Périer - 92120 Montrouge - FRANCE 
 Tel +33(0)184 163 677 / Fax +33(0)972 283 705
 
 
 
 -- 
 exensa_logo_mail.png
 Guillaume PITEL, Président 
 +33(0)626 222 431
 
 eXenSa S.A.S. http://www.exensa.com/ 
 41, rue Périer - 92120 Montrouge - FRANCE 
 Tel +33(0)184 163 677 / Fax +33(0)972 283 705



RE: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-14 Thread Wang, Ningjun (LNG-NPV)
Ø  Also, it can be a problem when reusing the same sparkcontext for many runs.

That is what happen to me. We use spark jobserver and use one sparkcontext for 
all jobs. The SPARK_LOCAL_DIRS is not cleaned up and is eating disk space 
quickly.

Ningjun


From: Marius Soutier [mailto:mps@gmail.com]
Sent: Tuesday, April 14, 2015 12:27 PM
To: Guillaume Pitel
Cc: user@spark.apache.org
Subject: Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

That's true, spill dirs don't get cleaned up when something goes wrong. We are 
are restarting long running jobs once in a while for cleanups and have 
spark.cleaner.ttl set to a lower value than the default.

On 14.04.2015, at 17:57, Guillaume Pitel 
guillaume.pi...@exensa.commailto:guillaume.pi...@exensa.com wrote:

Right, I remember now, the only problematic case is when things go bad and the 
cleaner is not executed.

Also, it can be a problem when reusing the same sparkcontext for many runs.

Guillaume
It cleans the work dir, and SPARK_LOCAL_DIRS should be cleaned automatically. 
From the source code comments:

// SPARK_LOCAL_DIRS environment variable, and deleted by the Worker when the

// application finishes.


On 13.04.2015, at 11:26, Guillaume Pitel 
guillaume.pi...@exensa.commailto:guillaume.pi...@exensa.com wrote:

Does it also cleanup spark local dirs ? I thought it was only cleaning 
$SPARK_HOME/work/

Guillaume
I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example:

export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true 
-Dspark.worker.cleanup.appDataTtl=seconds

On 11.04.2015, at 00:01, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.commailto:ningjun.w...@lexisnexis.com wrote:

Does anybody have an answer for this?

Thanks
Ningjun

From: Wang, Ningjun (LNG-NPV)
Sent: Thursday, April 02, 2015 12:14 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

I set SPARK_LOCAL_DIRS   to   C:\temp\spark-temp. When RDDs are shuffled, spark 
writes to this folder. I found that the disk space of this folder keep on 
increase quickly and at certain point I will run out of disk space.

I wonder does spark clean up the disk space in this folder once the shuffle 
operation is done? If not, I need to write a job to clean it up myself. But how 
do I know which sub folders there can be removed?

Ningjun


--
exensa_logo_mail.png

Guillaume PITEL, Président
+33(0)626 222 431

eXenSa S.A.S.http://www.exensa.com/
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705



--
exensa_logo_mail.png

Guillaume PITEL, Président
+33(0)626 222 431

eXenSa S.A.S.http://www.exensa.com/
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705




Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-14 Thread Guillaume Pitel
Right, I remember now, the only problematic case is when things go bad 
and the cleaner is not executed.


Also, it can be a problem when reusing the same sparkcontext for many runs.

Guillaume
It cleans the work dir, and SPARK_LOCAL_DIRS should be cleaned 
automatically. From the source code comments:

// SPARK_LOCAL_DIRS environment variable, and deleted by the Worker when the
// application finishes.


On 13.04.2015, at 11:26, Guillaume Pitel guillaume.pi...@exensa.com 
mailto:guillaume.pi...@exensa.com wrote:


Does it also cleanup spark local dirs ? I thought it was only 
cleaning $SPARK_HOME/work/


Guillaume

I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example:

export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true 
-Dspark.worker.cleanup.appDataTtl=seconds


On 11.04.2015, at 00:01, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.com mailto:ningjun.w...@lexisnexis.com 
wrote:


Does anybody have an answer for this?
Thanks
Ningjun
*From:*Wang, Ningjun (LNG-NPV)
*Sent:*Thursday, April 02, 2015 12:14 PM
*To:*user@spark.apache.org mailto:user@spark.apache.org
*Subject:*Is the disk space in SPARK_LOCAL_DIRS cleanned up?
I set SPARK_LOCAL_DIRS   to C:\temp\spark-temp. When RDDs are 
shuffled, spark writes to this folder. I found that the disk space 
of this folder keep on increase quickly and at certain point I will 
run out of disk space.
I wonder does spark clean up the disk spacein this folder once the 
shuffle operation is done? If not, I need to write a job to clean 
it up myself. But how do I know which sub folders there can be removed?

Ningjun





--
exensa_logo_mail.png


*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. http://www.exensa.com/
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705






--
eXenSa


*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. http://www.exensa.com/
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705



Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-13 Thread Marius Soutier
I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example:

export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true 
-Dspark.worker.cleanup.appDataTtl=seconds

 On 11.04.2015, at 00:01, Wang, Ningjun (LNG-NPV) 
 ningjun.w...@lexisnexis.com wrote:
 
 Does anybody have an answer for this?
  
 Thanks
 Ningjun
  
 From: Wang, Ningjun (LNG-NPV) 
 Sent: Thursday, April 02, 2015 12:14 PM
 To: user@spark.apache.org mailto:user@spark.apache.org
 Subject: Is the disk space in SPARK_LOCAL_DIRS cleanned up?
  
 I set SPARK_LOCAL_DIRS   to   C:\temp\spark-temp. When RDDs are shuffled, 
 spark writes to this folder. I found that the disk space of this folder keep 
 on increase quickly and at certain point I will run out of disk space. 
  
 I wonder does spark clean up the disk space in this folder once the shuffle 
 operation is done? If not, I need to write a job to clean it up myself. But 
 how do I know which sub folders there can be removed?
  
 Ningjun



Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-13 Thread Guillaume Pitel
Does it also cleanup spark local dirs ? I thought it was only cleaning 
$SPARK_HOME/work/


Guillaume

I have set SPARK_WORKER_OPTS in spark-env.sh for that. For example:

export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true 
-Dspark.worker.cleanup.appDataTtl=seconds


On 11.04.2015, at 00:01, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.com mailto:ningjun.w...@lexisnexis.com wrote:


Does anybody have an answer for this?
Thanks
Ningjun
*From:*Wang, Ningjun (LNG-NPV)
*Sent:*Thursday, April 02, 2015 12:14 PM
*To:*user@spark.apache.org mailto:user@spark.apache.org
*Subject:*Is the disk space in SPARK_LOCAL_DIRS cleanned up?
I set SPARK_LOCAL_DIRS   to   C:\temp\spark-temp. When RDDs are 
shuffled, spark writes to this folder. I found that the disk space of 
this folder keep on increase quickly and at certain point I will run 
out of disk space.
I wonder does spark clean up the disk spacein this folder once the 
shuffle operation is done? If not, I need to write a job to clean it 
up myself. But how do I know which sub folders there can be removed?

Ningjun





--
eXenSa


*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. http://www.exensa.com/
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705



Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-11 Thread Guillaume Pitel

Hi,

I had to setup a cron job for cleanup in $SPARK_HOME/work and in 
$SPARK_LOCAL_DIRS.


Here are the cron lines. Unfortunately it's for *nix machines, I guess 
you will have to adapt it seriously for Windows.


12 * * * *  find $SPARK_HOME/work -cmin +1440 -prune -exec rm -rf {} \+
32 * * * *  find /tmp -type d -cmin +1440 -name spark-*-*-* -prune 
-exec rm -rf {} \+
52 * * * *  find $SPARK_LOCAL_DIR -mindepth 1 -maxdepth 1 -type d -cmin 
+1440 -name spark-*-*-* -prune -exec rm -rf {} \+


They remove directories older than a day.

The cron have to be setup both on the executors AND on the driver (the 
spark local dir of the driver can be heavily used if using a lot of 
broadcast)


I think in recent versions of Spark, the $SPARK_HOME/work is correctly 
cleaned up, but adding a cron won't hurt.


Guillaume


Does anybody have an answer for this?

Thanks

Ningjun

*From:*Wang, Ningjun (LNG-NPV)
*Sent:* Thursday, April 02, 2015 12:14 PM
*To:* user@spark.apache.org
*Subject:* Is the disk space in SPARK_LOCAL_DIRS cleanned up?

I set SPARK_LOCAL_DIRS   to   C:\temp\spark-temp. When RDDs are 
shuffled, spark writes to this folder. I found that the disk space of 
this folder keep on increase quickly and at certain point I will run 
out of disk space.


I wonder does spark clean up the disk spacein this folder once the 
shuffle operation is done? If not, I need to write a job to clean it 
up myself. But how do I know which sub folders there can be removed?


Ningjun




--
eXenSa


*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. http://www.exensa.com/
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705



RE: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-10 Thread Wang, Ningjun (LNG-NPV)
Does anybody have an answer for this?

Thanks
Ningjun

From: Wang, Ningjun (LNG-NPV)
Sent: Thursday, April 02, 2015 12:14 PM
To: user@spark.apache.org
Subject: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

I set SPARK_LOCAL_DIRS   to   C:\temp\spark-temp. When RDDs are shuffled, spark 
writes to this folder. I found that the disk space of this folder keep on 
increase quickly and at certain point I will run out of disk space.

I wonder does spark clean up the disk space in this folder once the shuffle 
operation is done? If not, I need to write a job to clean it up myself. But how 
do I know which sub folders there can be removed?

Ningjun