RE: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

2021-03-08 Thread Ranju Jain
Hi Mich,

I will check GCP buckets , don’t have much idea about how it works. It will be 
easy for me to study GCP bucket if you validate my understanding below:

Are you looking for performance or durability?
[Ranju]:Durability or I would say feasibility.

In general, every executor on every node should have access to GCP buckets 
created under project (assuming you are using service account to run the spark 
job):
[Ranju]: Please check my understanding on the statement written above.
1.  This bucket persist  after executors completes the job [i.e. stores 
processed records into bucket] and terminates.

Or it works as ephemeral storage , which will 
be there till executor is live?

  1.  GCP bucket shareable across driver Pod and Executor Pods.

Regards
Ranju

From: Mich Talebzadeh 
Sent: Monday, March 8, 2021 8:32 PM
To: Ranju Jain 
Cc: Ranju Jain ; Attila Zsolt Piros 
; user@spark.apache.org
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part 
Files Storage

Hi Ranju,

In your statement:

"What is the best shared storage can be used to collate all executors part 
files at one place."

Are you looking for performance or durability?

In general, every executor on every node should have access to GCP buckets 
created under project (assuming you are using service account to run the spark 
job):


gs://tmp_storage_bucket/



So you can try it and see if it works (create it first). Of course Spark needs 
to be aware of it.



HTH



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw







Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Mon, 8 Mar 2021 at 14:46, Ranju Jain 
mailto:ranju.j...@ericsson.com>> wrote:
Hi Mich,

Purpose is all spark executors running on K8s worker nodes writes their 
processed task data [part files] to some shared storage , and now the Driver pod
running on same kubernetes Cluster will access that shared storage and convert 
all those part files to single file.

So I am looking for Shared Storage Options available to persist the part files.
What is the best shared storage can be used to collate all executors part files 
at one place.

Regards
Ranju

From: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Sent: Monday, March 8, 2021 8:06 PM
To: Ranju Jain 
mailto:ranju.j...@ericsson.com.invalid>>
Cc: Attila Zsolt Piros 
mailto:piros.attila.zs...@gmail.com>>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part 
Files Storage

If the purpose is to use for temporary work and write put it in temporary 
sub-directory under a give bucket

spark.conf.set("temporaryGcsBucket", config['GCPVariables']['tmp_bucket'])

That dict reference is to this yml file entry

CPVariables:
   tmp_bucket: "tmp_storage_bucket/tmp"


just create a temporary bucket and sub-directory tmp underneath

tmp_storage_bucket/tmp



HTH





LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw







Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Sun, 7 Mar 2021 at 16:23, Ranju Jain 
mailto:ranju.j...@ericsson.com.invalid>> wrote:
Hi,

I need to save the Executors processed data in the form of part files , but I 
think persistent Volume is not an option for this as Executors terminates after 
their work completes.
So I am thinking to use shared volume across executor pods.

Should I go with NFS or is there any other Volume option as well to explore?

Regards
Ranju


Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

2021-03-08 Thread Mich Talebzadeh
Hi Ranju,

In your statement:

"What is the best shared storage can be used to collate all executors part
files at one place."

Are you looking for performance or durability?

In general, every executor on every node should have access to GCP buckets
created under project (assuming you are using service account to run the
spark job):

gs://tmp_storage_bucket/


So you can try it and see if it works (create it first). Of course Spark
needs to be aware of it.


HTH


LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 8 Mar 2021 at 14:46, Ranju Jain  wrote:

> Hi Mich,
>
>
>
> Purpose is all spark executors running on K8s worker nodes writes their
> processed task data [part files] to some shared storage , and now the
> Driver pod
>
> running on same kubernetes Cluster will access that shared storage and
> convert all those part files to single file.
>
>
>
> So I am looking for Shared Storage Options available to persist the part
> files.
>
> What is the best shared storage can be used to collate all executors part
> files at one place.
>
>
>
> Regards
>
> Ranju
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Monday, March 8, 2021 8:06 PM
> *To:* Ranju Jain 
> *Cc:* Attila Zsolt Piros ;
> user@spark.apache.org
> *Subject:* Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor
> Part Files Storage
>
>
>
> If the purpose is to use for temporary work and write put it in temporary
> sub-directory under a give bucket
>
>
>
> spark.conf.set("temporaryGcsBucket", config['GCPVariables']['tmp_bucket'])
>
>
>
> That dict reference is to this yml file entry
>
>
>
> CPVariables:
>
>tmp_bucket: "tmp_storage_bucket/tmp"
>
>
>
>
>
> just create a temporary bucket and sub-directory tmp underneath
>
>
>
> tmp_storage_bucket/tmp
>
>
>
>
>
> HTH
>
>
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Sun, 7 Mar 2021 at 16:23, Ranju Jain 
> wrote:
>
> Hi,
>
>
>
> I need to save the Executors processed data in the form of part files ,
> but I think persistent Volume is not an option for this as Executors
> terminates after their work completes.
>
> So I am thinking to use shared volume across executor pods.
>
>
>
> Should I go with NFS or is there any other Volume option as well to
> explore?
>
>
>
> Regards
>
> Ranju
>
>


RE: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

2021-03-08 Thread Ranju Jain
Hi Mich,

Purpose is all spark executors running on K8s worker nodes writes their 
processed task data [part files] to some shared storage , and now the Driver pod
running on same kubernetes Cluster will access that shared storage and convert 
all those part files to single file.

So I am looking for Shared Storage Options available to persist the part files.
What is the best shared storage can be used to collate all executors part files 
at one place.

Regards
Ranju

From: Mich Talebzadeh 
Sent: Monday, March 8, 2021 8:06 PM
To: Ranju Jain 
Cc: Attila Zsolt Piros ; user@spark.apache.org
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part 
Files Storage

If the purpose is to use for temporary work and write put it in temporary 
sub-directory under a give bucket

spark.conf.set("temporaryGcsBucket", config['GCPVariables']['tmp_bucket'])

That dict reference is to this yml file entry

CPVariables:
   tmp_bucket: "tmp_storage_bucket/tmp"


just create a temporary bucket and sub-directory tmp underneath

tmp_storage_bucket/tmp



HTH





LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw







Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Sun, 7 Mar 2021 at 16:23, Ranju Jain 
mailto:ranju.j...@ericsson.com.invalid>> wrote:
Hi,

I need to save the Executors processed data in the form of part files , but I 
think persistent Volume is not an option for this as Executors terminates after 
their work completes.
So I am thinking to use shared volume across executor pods.

Should I go with NFS or is there any other Volume option as well to explore?

Regards
Ranju


Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

2021-03-08 Thread Mich Talebzadeh
If the purpose is to use for temporary work and write put it in temporary
sub-directory under a give bucket

spark.conf.set("temporaryGcsBucket", config['GCPVariables']['tmp_bucket'])

That dict reference is to this yml file entry

CPVariables:
   tmp_bucket: "tmp_storage_bucket/tmp"


just create a temporary bucket and sub-directory tmp underneath

tmp_storage_bucket/tmp


HTH



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 7 Mar 2021 at 16:23, Ranju Jain 
wrote:

> Hi,
>
>
>
> I need to save the Executors processed data in the form of part files ,
> but I think persistent Volume is not an option for this as Executors
> terminates after their work completes.
>
> So I am thinking to use shared volume across executor pods.
>
>
>
> Should I go with NFS or is there any other Volume option as well to
> explore?
>
>
>
> Regards
>
> Ranju
>


Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

2021-03-08 Thread Jacek Laskowski
Hi,

On GCP I'd go for buckets in Google Storage. Not sure how reliable it is in
production deployments though. Only demo experience here.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Mon, Mar 8, 2021 at 12:33 PM Ranju Jain  wrote:

> Hi Jacek,
>
>
>
> I am using this property
> spark.kubernetes.executor.deleteOnTermination=true only to troubleshoot
> else I am freeing up resources after executors complete their job.
>
> Now I want to use some Shared storage which can be shared by all executors
> to write the part files.
>
> Which Kubernetes Storage I should go for?
>
>
>
> Regards
>
> Ranju
>
> *From:* Jacek Laskowski 
> *Sent:* Monday, March 8, 2021 4:14 PM
> *To:* Ranju Jain 
> *Cc:* Attila Zsolt Piros ;
> user@spark.apache.org
> *Subject:* Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor
> Part Files Storage
>
>
>
> Hi,
>
>
>
> > as Executors terminates after their work completes.
>
>
>
> --conf spark.kubernetes.executor.deleteOnTermination=false ?
>
>
> Pozdrawiam,
>
> Jacek Laskowski
>
> 
>
> https://about.me/JacekLaskowski
>
> "The Internals Of" Online Books
> <https://protect2.fireeye.com/v1/url?k=901b36bb-cf800fbe-901b7620-86d2114eab2f-836d4bb779fe8f92=1=d3b471ef-b3ce-4cfd-9bcd-f42590d0f10b=https%3A%2F%2Fbooks.japila.pl%2F>
>
> Follow me on https://twitter.com/jaceklaskowski
>
>
> <https://twitter.com/jaceklaskowski>
>
>
>
>
>
> On Sun, Mar 7, 2021 at 5:23 PM Ranju Jain 
> wrote:
>
> Hi,
>
>
>
> I need to save the Executors processed data in the form of part files ,
> but I think persistent Volume is not an option for this as Executors
> terminates after their work completes.
>
> So I am thinking to use shared volume across executor pods.
>
>
>
> Should I go with NFS or is there any other Volume option as well to
> explore?
>
>
>
> Regards
>
> Ranju
>
>


RE: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

2021-03-08 Thread Ranju Jain
Hi Jacek,

I am using this property spark.kubernetes.executor.deleteOnTermination=true 
only to troubleshoot else I am freeing up resources after executors complete 
their job.
Now I want to use some Shared storage which can be shared by all executors to 
write the part files.
Which Kubernetes Storage I should go for?

Regards
Ranju
From: Jacek Laskowski 
Sent: Monday, March 8, 2021 4:14 PM
To: Ranju Jain 
Cc: Attila Zsolt Piros ; user@spark.apache.org
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part 
Files Storage

Hi,

> as Executors terminates after their work completes.

--conf spark.kubernetes.executor.deleteOnTermination=false ?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online 
Books<https://protect2.fireeye.com/v1/url?k=901b36bb-cf800fbe-901b7620-86d2114eab2f-836d4bb779fe8f92=1=d3b471ef-b3ce-4cfd-9bcd-f42590d0f10b=https%3A%2F%2Fbooks.japila.pl%2F>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Sun, Mar 7, 2021 at 5:23 PM Ranju Jain 
mailto:ranju.j...@ericsson.com.invalid>> wrote:
Hi,

I need to save the Executors processed data in the form of part files , but I 
think persistent Volume is not an option for this as Executors terminates after 
their work completes.
So I am thinking to use shared volume across executor pods.

Should I go with NFS or is there any other Volume option as well to explore?

Regards
Ranju


Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

2021-03-08 Thread Jacek Laskowski
Hi,

> as Executors terminates after their work completes.

--conf spark.kubernetes.executor.deleteOnTermination=false ?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




On Sun, Mar 7, 2021 at 5:23 PM Ranju Jain 
wrote:

> Hi,
>
>
>
> I need to save the Executors processed data in the form of part files ,
> but I think persistent Volume is not an option for this as Executors
> terminates after their work completes.
>
> So I am thinking to use shared volume across executor pods.
>
>
>
> Should I go with NFS or is there any other Volume option as well to
> explore?
>
>
>
> Regards
>
> Ranju
>