Re: [spark.local.dir] comma separated list does not work

2024-01-12 Thread Andrew Petersen
Actually, that did work, thanks.
What I previously tried that did not work was
#BSUB -env "all,SPARK_LOCAL_DIRS=/tmp,/share/,SPARK_PID_DIR=..."

However, I am still getting "No space left on device" errors. It seems that
I need hierarchical directories, and round robin distribution is not good
enough. Any suggestions for getting Spark to write to dir2 when dir1 fails?
Or if round robin can be implemented so that the first task attempt writes
to dir1, but if the 1st attempt fails, the 2nd task attempt is on dir2?

On Fri, Jan 12, 2024 at 10:23 PM Koert Kuipers  wrote:

> try it without spaces?
> export SPARK_LOCAL_DIRS="/tmp,/share/"
>
> On Fri, Jan 12, 2024 at 5:00 PM Andrew Petersen 
> wrote:
>
>> Hello Spark community
>>
>> SPARK_LOCAL_DIRS or
>> spark.local.dir
>> is supposed to accept a list.
>>
>> I want to list one local (fast) drive, followed by a gpfs network drive,
>> similar to what is done here:
>>
>> https://cug.org/proceedings/cug2016_proceedings/includes/files/pap129s2-file1.pdf
>> "Thus it is preferable to bias the data towards faster storage by
>> including multiple directories on the faster devices (e.g., SPARK LOCAL
>> DIRS=/tmp/spark1, /tmp/spark2, /tmp/spark3, /lus/scratch/sparkscratch/)."
>> The purpose of this is to get both benefits of speed and avoiding "out of
>> space" errors.
>>
>> However, for me, Spark is only considering the 1st directory on the list:
>> export SPARK_LOCAL_DIRS="/tmp, /share/"
>>
>> I am using Spark 3.4.1. Does anyone have any experience getting this to
>> work? If so can you suggest a simple example I can try and tell me which
>> version of Spark you are using?
>>
>> Regards
>> Andrew
>>
>>
>>
>>
>> I am trying to use 2 local drives
>>
>> --
>> Andrew Petersen, PhD
>> Advanced Computing, Office of Information Technology
>> 2620 Hillsborough Street
>> datascience.oit.ncsu.edu
>>
>
> CONFIDENTIALITY NOTICE: This electronic communication and any files
> transmitted with it are confidential, privileged and intended solely for
> the use of the individual or entity to whom they are addressed. If you are
> not the intended recipient, you are hereby notified that any disclosure,
> copying, distribution (electronic or otherwise) or forwarding of, or the
> taking of any action in reliance on the contents of this transmission is
> strictly prohibited. Please notify the sender immediately by e-mail if you
> have received this email by mistake and delete this email from your system.
>
> Is it necessary to print this email? If you care about the environment
> like we do, please refrain from printing emails. It helps to keep the
> environment forested and litter-free.



-- 
Andrew Petersen, PhD
Advanced Computing, Office of Information Technology
2620 Hillsborough Street
datascience.oit.ncsu.edu


Re: [spark.local.dir] comma separated list does not work

2024-01-12 Thread Andrew Petersen
Without spaces was the first thing I tried. The information in the pdf file
inspired me to try the space.

On Fri, Jan 12, 2024 at 10:23 PM Koert Kuipers  wrote:

> try it without spaces?
> export SPARK_LOCAL_DIRS="/tmp,/share/"
>
> On Fri, Jan 12, 2024 at 5:00 PM Andrew Petersen 
> wrote:
>
>> Hello Spark community
>>
>> SPARK_LOCAL_DIRS or
>> spark.local.dir
>> is supposed to accept a list.
>>
>> I want to list one local (fast) drive, followed by a gpfs network drive,
>> similar to what is done here:
>>
>> https://cug.org/proceedings/cug2016_proceedings/includes/files/pap129s2-file1.pdf
>> "Thus it is preferable to bias the data towards faster storage by
>> including multiple directories on the faster devices (e.g., SPARK LOCAL
>> DIRS=/tmp/spark1, /tmp/spark2, /tmp/spark3, /lus/scratch/sparkscratch/)."
>> The purpose of this is to get both benefits of speed and avoiding "out of
>> space" errors.
>>
>> However, for me, Spark is only considering the 1st directory on the list:
>> export SPARK_LOCAL_DIRS="/tmp, /share/"
>>
>> I am using Spark 3.4.1. Does anyone have any experience getting this to
>> work? If so can you suggest a simple example I can try and tell me which
>> version of Spark you are using?
>>
>> Regards
>> Andrew
>>
>>
>>
>>
>> I am trying to use 2 local drives
>>
>> --
>> Andrew Petersen, PhD
>> Advanced Computing, Office of Information Technology
>> 2620 Hillsborough Street
>> datascience.oit.ncsu.edu
>>
>
> CONFIDENTIALITY NOTICE: This electronic communication and any files
> transmitted with it are confidential, privileged and intended solely for
> the use of the individual or entity to whom they are addressed. If you are
> not the intended recipient, you are hereby notified that any disclosure,
> copying, distribution (electronic or otherwise) or forwarding of, or the
> taking of any action in reliance on the contents of this transmission is
> strictly prohibited. Please notify the sender immediately by e-mail if you
> have received this email by mistake and delete this email from your system.
>
> Is it necessary to print this email? If you care about the environment
> like we do, please refrain from printing emails. It helps to keep the
> environment forested and litter-free.



-- 
Andrew Petersen, PhD
Advanced Computing, Office of Information Technology
2620 Hillsborough Street
datascience.oit.ncsu.edu


Re: [spark.local.dir] comma separated list does not work

2024-01-12 Thread Koert Kuipers
try it without spaces?
export SPARK_LOCAL_DIRS="/tmp,/share/"

On Fri, Jan 12, 2024 at 5:00 PM Andrew Petersen 
wrote:

> Hello Spark community
>
> SPARK_LOCAL_DIRS or
> spark.local.dir
> is supposed to accept a list.
>
> I want to list one local (fast) drive, followed by a gpfs network drive,
> similar to what is done here:
>
> https://cug.org/proceedings/cug2016_proceedings/includes/files/pap129s2-file1.pdf
> "Thus it is preferable to bias the data towards faster storage by
> including multiple directories on the faster devices (e.g., SPARK LOCAL
> DIRS=/tmp/spark1, /tmp/spark2, /tmp/spark3, /lus/scratch/sparkscratch/)."
> The purpose of this is to get both benefits of speed and avoiding "out of
> space" errors.
>
> However, for me, Spark is only considering the 1st directory on the list:
> export SPARK_LOCAL_DIRS="/tmp, /share/"
>
> I am using Spark 3.4.1. Does anyone have any experience getting this to
> work? If so can you suggest a simple example I can try and tell me which
> version of Spark you are using?
>
> Regards
> Andrew
>
>
>
>
> I am trying to use 2 local drives
>
> --
> Andrew Petersen, PhD
> Advanced Computing, Office of Information Technology
> 2620 Hillsborough Street
> datascience.oit.ncsu.edu
>

-- 
CONFIDENTIALITY NOTICE: This electronic communication and any files 
transmitted with it are confidential, privileged and intended solely for 
the use of the individual or entity to whom they are addressed. If you are 
not the intended recipient, you are hereby notified that any disclosure, 
copying, distribution (electronic or otherwise) or forwarding of, or the 
taking of any action in reliance on the contents of this transmission is 
strictly prohibited. Please notify the sender immediately by e-mail if you 
have received this email by mistake and delete this email from your system.


Is it necessary to print this email? If you care about the environment 
like we do, please refrain from printing emails. It helps to keep the 
environment forested and litter-free.


[spark.local.dir] comma separated list does not work

2024-01-12 Thread Andrew Petersen
Hello Spark community

SPARK_LOCAL_DIRS or
spark.local.dir
is supposed to accept a list.

I want to list one local (fast) drive, followed by a gpfs network drive,
similar to what is done here:
https://cug.org/proceedings/cug2016_proceedings/includes/files/pap129s2-file1.pdf
"Thus it is preferable to bias the data towards faster storage by including
multiple directories on the faster devices (e.g., SPARK LOCAL
DIRS=/tmp/spark1, /tmp/spark2, /tmp/spark3, /lus/scratch/sparkscratch/)."
The purpose of this is to get both benefits of speed and avoiding "out of
space" errors.

However, for me, Spark is only considering the 1st directory on the list:
export SPARK_LOCAL_DIRS="/tmp, /share/"

I am using Spark 3.4.1. Does anyone have any experience getting this to
work? If so can you suggest a simple example I can try and tell me which
version of Spark you are using?

Regards
Andrew




I am trying to use 2 local drives

-- 
Andrew Petersen, PhD
Advanced Computing, Office of Information Technology
2620 Hillsborough Street
datascience.oit.ncsu.edu