subject:"spark on kubernetes"

Re: Spark on Kubernetes

2024-04-30 Thread Mich Talebzadeh

Hi,
In k8s the driver is responsible for executor creation. The likelihood of
your problem is that Insufficient memory allocated for executors in the K8s
cluster. Even with dynamic allocation, k8s won't  schedule executor pods if
there is not enough free memory to fulfill their resource requests.

My suggestions

   - Increase Executor Memory: Allocate more memory per executor (e.g., 2GB
   or 3GB) to allow for multiple executors within available cluster memory.
   - Adjust Driver Pod Resources: Ensure the driver pod has enough memory
   to run Spark and manage executors.
   - Optimize Resource Management: Explore on-demand allocation or
   adjusting allocation granularity for better resource utilization. For
   example look at documents for Executor On-Demand Allocation
   (spark.executor.cores=0): and spark.dynamicAllocation.minExecutors &
   spark.dynamicAllocation.maxExecutors

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".

On Tue, 30 Apr 2024 at 04:29, Tarun raghav 
wrote:

> Respected Sir/Madam,
> I am Tarunraghav. I have a query regarding spark on kubernetes.
>
> We have an eks cluster, within which we have spark installed in the pods.
> We set the executor memory as 1GB and set the executor instances as 2, I
> have also set dynamic allocation as true. So when I try to read a 3 GB CSV
> file or parquet file, it is supposed to increase the number of pods by 2.
> But the number of executor pods is zero.
> I don't know why executor pods aren't being created, even though I set
> executor instance as 2. Please suggest a solution for this.
>
> Thanks & Regards,
> Tarunraghav
>
>

Spark on Kubernetes

2024-04-29 Thread Tarun raghav

Respected Sir/Madam,
I am Tarunraghav. I have a query regarding spark on kubernetes.

We have an eks cluster, within which we have spark installed in the pods.
We set the executor memory as 1GB and set the executor instances as 2, I
have also set dynamic allocation as true. So when I try to read a 3 GB CSV
file or parquet file, it is supposed to increase the number of pods by 2.
But the number of executor pods is zero.
I don't know why executor pods aren't being created, even though I set
executor instance as 2. Please suggest a solution for this.

Thanks & Regards,
Tarunraghav

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh

Thanks for your kind words Sri

Well it is true that as yet spark on kubernetes is not on-par with spark on
YARN in maturity and essentially spark on kubernetes is still work in
progress.* So in the first place IMO one needs to think why executors are
failing. What causes this behaviour? Is it the code or some inadequate
set-up? *These things come to my mind


   - Resource Allocation: Insufficient resources (CPU, memory) can lead to
   executor failures.
   - Mis-configuration Issues: Verify that the configurations are
   appropriate for your workload.
   - External Dependencies: If your Spark job relies on external services
   or data sources, ensure they are accessible. Issues such as network
   problems or unavailability of external services can lead to executor
   failures.
   - Data Skew: Uneven distribution of data across partitions can lead to
   data skew and cause some executors to process significantly more data than
   others. This can lead to resource exhaustion on specific executors.
   - Spark Version and Kubernetes Compatibility: Is Spark running on EKS or
   GKE -- that you are using a Spark version that is compatible with your
   Kubernetes environment. These vendors normally run older, more stable
   versions of Spark. Compatibility issues can arise when using your newer
   version of Spark.
   - How up-to-date are your docker images on container registries (ECR,
   GCR).Is there any incompatibility between docker images built on a Spark
   version and the host spark version you are submitting your spark-submit
   from?

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 19 Feb 2024 at 23:18, Sri Potluri  wrote:

> Dear Mich,
>
> Thank you for your detailed response and the suggested approach to
> handling retry logic. I appreciate you taking the time to outline the
> method of embedding custom retry mechanisms directly into the application
> code.
>
> While the solution of wrapping the main logic of the Spark job in a loop
> for controlling the number of retries is technically sound and offers a
> workaround, it may not be the most efficient or maintainable solution for
> organizations running a large number of Spark jobs. Modifying each
> application to include custom retry logic can be a significant undertaking,
> introducing variability in how retries are handled across different jobs,
> and require additional testing and maintenance.
>
> Ideally, operational concerns like retry behavior in response to
> infrastructure failures should be decoupled from the business logic of
> Spark applications. This separation allows data engineers and scientists to
> focus on the application logic without needing to implement and test
> infrastructure resilience mechanisms.
>
> Thank you again for your time and assistance.
>
> Best regards,
> Sri Potluri
>
> On Mon, Feb 19, 2024 at 5:03 PM Mich Talebzadeh 
> wrote:
>
>> Went through your issue with the code running on k8s
>>
>> When an executor of a Spark application fails, the system attempts to
>> maintain the desired level of parallelism by automatically recreating a new
>> executor to replace the failed one. While this behavior is beneficial for
>> transient errors, ensuring that the application continues to run, it
>> becomes problematic in cases where the failure is due to a persistent issue
>> (such as misconfiguration, inaccessible external resources, or incompatible
>> environment settings). In such scenarios, the application enters a loop,
>> continuously trying to recreate executors, which leads to resource wastage
>> and complicates application management.
>>
>> Well fault tolerance is built especially in k8s cluster. You can
>> implement your own logic to control the retry attempts. You can do this
>> by wrapping the main logic of your Spark job in a loop and controlling the
>> number of retries. If a persistent issue is detected, you can choose to
>> stop the job. Today is the third time that looping control has come up :)
>>
>> Take this code
>>
>> import time
>> max_retries = 5 retries = 0 while retries < max_retries: try: # Your
>> Spark job logic here except Exception as e: # Log the exception
>> print(f"Exception in Spark job:

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Cheng Pan

Spark has supported the window-based executor failure-tracking mechanism for 
YARN for a long time, SPARK-41210[1][2] (included in 3.5.0) extended this 
feature to K8s.

[1] https://issues.apache.org/jira/browse/SPARK-41210
[2] https://github.com/apache/spark/pull/38732

Thanks,
Cheng Pan


> On Feb 19, 2024, at 23:59, Sri Potluri  wrote:
> 
> Hello Spark Community,
> 
> I am currently leveraging Spark on Kubernetes, managed by the Spark Operator, 
> for running various Spark applications. While the system generally works 
> well, I've encountered a challenge related to how Spark applications handle 
> executor failures, specifically in scenarios where executors enter an error 
> state due to persistent issues.
> 
> Problem Description
> 
> When an executor of a Spark application fails, the system attempts to 
> maintain the desired level of parallelism by automatically recreating a new 
> executor to replace the failed one. While this behavior is beneficial for 
> transient errors, ensuring that the application continues to run, it becomes 
> problematic in cases where the failure is due to a persistent issue (such as 
> misconfiguration, inaccessible external resources, or incompatible 
> environment settings). In such scenarios, the application enters a loop, 
> continuously trying to recreate executors, which leads to resource wastage 
> and complicates application management.
> 
> Desired Behavior
> 
> Ideally, I would like to have a mechanism to limit the number of retries for 
> executor recreation. If the system fails to successfully create an executor 
> more than a specified number of times (e.g., 5 attempts), the entire Spark 
> application should fail and stop trying to recreate the executor. This 
> behavior would help in efficiently managing resources and avoiding prolonged 
> failure states.
> 
> Questions for the Community
> 
> 1. Is there an existing configuration or method within Spark or the Spark 
> Operator to limit executor recreation attempts and fail the job after 
> reaching a threshold?
>
> 2. Has anyone else encountered similar challenges and found workarounds or 
> solutions that could be applied in this context?
> 
> 
> Additional Context
> 
> I have explored Spark's task and stage retry configurations 
> (`spark.task.maxFailures`, `spark.stage.maxConsecutiveAttempts`), but these 
> do not directly address the issue of limiting executor creation retries. 
> Implementing a custom monitoring solution to track executor failures and 
> manually stop the application is a potential workaround, but it would be 
> preferable to have a more integrated solution.
> 
> I appreciate any guidance, insights, or feedback you can provide on this 
> matter.
> 
> Thank you for your time and support.
> 
> Best regards,
> Sri P


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Sri Potluri

Dear Mich,

Thank you for your detailed response and the suggested approach to handling
retry logic. I appreciate you taking the time to outline the method of
embedding custom retry mechanisms directly into the application code.

While the solution of wrapping the main logic of the Spark job in a loop
for controlling the number of retries is technically sound and offers a
workaround, it may not be the most efficient or maintainable solution for
organizations running a large number of Spark jobs. Modifying each
application to include custom retry logic can be a significant undertaking,
introducing variability in how retries are handled across different jobs,
and require additional testing and maintenance.

Ideally, operational concerns like retry behavior in response to
infrastructure failures should be decoupled from the business logic of
Spark applications. This separation allows data engineers and scientists to
focus on the application logic without needing to implement and test
infrastructure resilience mechanisms.

Thank you again for your time and assistance.

Best regards,
Sri Potluri

On Mon, Feb 19, 2024 at 5:03 PM Mich Talebzadeh 
wrote:

> Went through your issue with the code running on k8s
>
> When an executor of a Spark application fails, the system attempts to
> maintain the desired level of parallelism by automatically recreating a new
> executor to replace the failed one. While this behavior is beneficial for
> transient errors, ensuring that the application continues to run, it
> becomes problematic in cases where the failure is due to a persistent issue
> (such as misconfiguration, inaccessible external resources, or incompatible
> environment settings). In such scenarios, the application enters a loop,
> continuously trying to recreate executors, which leads to resource wastage
> and complicates application management.
>
> Well fault tolerance is built especially in k8s cluster. You can implement 
> your
> own logic to control the retry attempts. You can do this by wrapping the
> main logic of your Spark job in a loop and controlling the number of
> retries. If a persistent issue is detected, you can choose to stop the job.
> Today is the third time that looping control has come up :)
>
> Take this code
>
> import time
> max_retries = 5 retries = 0 while retries < max_retries: try: # Your Spark
> job logic here except Exception as e: # Log the exception print(f"Exception
> in Spark job: {str(e)}") # Increment the retry count retries += 1 # Sleep
> time.sleep(60) else: # Break out of the loop if the job completes
> successfully break
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 19 Feb 2024 at 19:21, Mich Talebzadeh 
> wrote:
>
>> Not that I am aware of any configuration parameter in Spark classic to
>> limit executor creation. Because of fault tolerance Spark will try to
>> recreate failed executors. Not really that familiar with the Spark operator
>> for k8s. There may be something there.
>>
>> Have you considered custom monitoring and handling within Spark itself
>> using max_retries = 5  etc?
>>
>> HTH
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Mon, 19 Feb 2024 at 18:34, Sri Potluri  wrote:
>>
>>> Hello Spark Community,
>>>
>>> I am currently leveraging Spark on Kubernetes, managed by the Spark
>>> Operator, for running various

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh

Went through your issue with the code running on k8s

When an executor of a Spark application fails, the system attempts to
maintain the desired level of parallelism by automatically recreating a new
executor to replace the failed one. While this behavior is beneficial for
transient errors, ensuring that the application continues to run, it
becomes problematic in cases where the failure is due to a persistent issue
(such as misconfiguration, inaccessible external resources, or incompatible
environment settings). In such scenarios, the application enters a loop,
continuously trying to recreate executors, which leads to resource wastage
and complicates application management.

Well fault tolerance is built especially in k8s cluster. You can implement your
own logic to control the retry attempts. You can do this by wrapping the
main logic of your Spark job in a loop and controlling the number of
retries. If a persistent issue is detected, you can choose to stop the job.
Today is the third time that looping control has come up :)

Take this code

import time
max_retries = 5 retries = 0 while retries < max_retries: try: # Your Spark
job logic here except Exception as e: # Log the exception print(f"Exception
in Spark job: {str(e)}") # Increment the retry count retries += 1 # Sleep
time.sleep(60) else: # Break out of the loop if the job completes
successfully break

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 19 Feb 2024 at 19:21, Mich Talebzadeh 
wrote:

> Not that I am aware of any configuration parameter in Spark classic to
> limit executor creation. Because of fault tolerance Spark will try to
> recreate failed executors. Not really that familiar with the Spark operator
> for k8s. There may be something there.
>
> Have you considered custom monitoring and handling within Spark itself
> using max_retries = 5  etc?
>
> HTH
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 19 Feb 2024 at 18:34, Sri Potluri  wrote:
>
>> Hello Spark Community,
>>
>> I am currently leveraging Spark on Kubernetes, managed by the Spark
>> Operator, for running various Spark applications. While the system
>> generally works well, I've encountered a challenge related to how Spark
>> applications handle executor failures, specifically in scenarios where
>> executors enter an error state due to persistent issues.
>>
>> *Problem Description*
>>
>> When an executor of a Spark application fails, the system attempts to
>> maintain the desired level of parallelism by automatically recreating a new
>> executor to replace the failed one. While this behavior is beneficial for
>> transient errors, ensuring that the application continues to run, it
>> becomes problematic in cases where the failure is due to a persistent issue
>> (such as misconfiguration, inaccessible external resources, or incompatible
>> environment settings). In such scenarios, the application enters a loop,
>> continuously trying to recreate executors, which leads to resource wastage
>> and complicates application management.
>>
>> *Desired Behavior*
>>
>> Ideally, I would like to have a mechanism to limit the number of retries
>> for executor recreation. If the system fails to successfully create an
>> executor more than a specified number of times (e.g., 5 attempts), the
>> entire Spark application should fail and stop trying to recreate the
>> executor. This behavior would help in efficiently managing resources and
>> avoiding prolonged failure states.
>>
>> *Questions for the Community*
>>
>> 1.

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh

Not that I am aware of any configuration parameter in Spark classic to
limit executor creation. Because of fault tolerance Spark will try to
recreate failed executors. Not really that familiar with the Spark operator
for k8s. There may be something there.

Have you considered custom monitoring and handling within Spark itself
using max_retries = 5  etc?

HTH

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 19 Feb 2024 at 18:34, Sri Potluri  wrote:

> Hello Spark Community,
>
> I am currently leveraging Spark on Kubernetes, managed by the Spark
> Operator, for running various Spark applications. While the system
> generally works well, I've encountered a challenge related to how Spark
> applications handle executor failures, specifically in scenarios where
> executors enter an error state due to persistent issues.
>
> *Problem Description*
>
> When an executor of a Spark application fails, the system attempts to
> maintain the desired level of parallelism by automatically recreating a new
> executor to replace the failed one. While this behavior is beneficial for
> transient errors, ensuring that the application continues to run, it
> becomes problematic in cases where the failure is due to a persistent issue
> (such as misconfiguration, inaccessible external resources, or incompatible
> environment settings). In such scenarios, the application enters a loop,
> continuously trying to recreate executors, which leads to resource wastage
> and complicates application management.
>
> *Desired Behavior*
>
> Ideally, I would like to have a mechanism to limit the number of retries
> for executor recreation. If the system fails to successfully create an
> executor more than a specified number of times (e.g., 5 attempts), the
> entire Spark application should fail and stop trying to recreate the
> executor. This behavior would help in efficiently managing resources and
> avoiding prolonged failure states.
>
> *Questions for the Community*
>
> 1. Is there an existing configuration or method within Spark or the Spark
> Operator to limit executor recreation attempts and fail the job after
> reaching a threshold?
>
> 2. Has anyone else encountered similar challenges and found workarounds or
> solutions that could be applied in this context?
>
>
> *Additional Context*
>
> I have explored Spark's task and stage retry configurations
> (`spark.task.maxFailures`, `spark.stage.maxConsecutiveAttempts`), but these
> do not directly address the issue of limiting executor creation retries.
> Implementing a custom monitoring solution to track executor failures and
> manually stop the application is a potential workaround, but it would be
> preferable to have a more integrated solution.
>
> I appreciate any guidance, insights, or feedback you can provide on this
> matter.
>
> Thank you for your time and support.
>
> Best regards,
> Sri P
>

[Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Sri Potluri

Hello Spark Community,

I am currently leveraging Spark on Kubernetes, managed by the Spark
Operator, for running various Spark applications. While the system
generally works well, I've encountered a challenge related to how Spark
applications handle executor failures, specifically in scenarios where
executors enter an error state due to persistent issues.

*Problem Description*

When an executor of a Spark application fails, the system attempts to
maintain the desired level of parallelism by automatically recreating a new
executor to replace the failed one. While this behavior is beneficial for
transient errors, ensuring that the application continues to run, it
becomes problematic in cases where the failure is due to a persistent issue
(such as misconfiguration, inaccessible external resources, or incompatible
environment settings). In such scenarios, the application enters a loop,
continuously trying to recreate executors, which leads to resource wastage
and complicates application management.

*Desired Behavior*

Ideally, I would like to have a mechanism to limit the number of retries
for executor recreation. If the system fails to successfully create an
executor more than a specified number of times (e.g., 5 attempts), the
entire Spark application should fail and stop trying to recreate the
executor. This behavior would help in efficiently managing resources and
avoiding prolonged failure states.

*Questions for the Community*

1. Is there an existing configuration or method within Spark or the Spark
Operator to limit executor recreation attempts and fail the job after
reaching a threshold?

2. Has anyone else encountered similar challenges and found workarounds or
solutions that could be applied in this context?


*Additional Context*

I have explored Spark's task and stage retry configurations
(`spark.task.maxFailures`, `spark.stage.maxConsecutiveAttempts`), but these
do not directly address the issue of limiting executor creation retries.
Implementing a custom monitoring solution to track executor failures and
manually stop the application is a potential workaround, but it would be
preferable to have a more integrated solution.

I appreciate any guidance, insights, or feedback you can provide on this
matter.

Thank you for your time and support.

Best regards,
Sri P

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Jagannath Majhi

Yes I have gone through it. So let's give me the setup. More context - My
jar file is in java language

On Mon, Feb 19, 2024, 8:53 PM Mich Talebzadeh 
wrote:

> Sure but first it would be beneficial to understand the way Spark works on
> Kubernetes and the concept.s
>
> Have a look at this article of mine
>
> Spark on Kubernetes, A Practitioner’s Guide
> <https://www.linkedin.com/pulse/spark-kubernetes-practitioners-guide-mich-talebzadeh-ph-d-%3FtrackingId=Wsu3lkoPaCWqGemYHe8%252BLQ%253D%253D/?trackingId=Wsu3lkoPaCWqGemYHe8%2BLQ%3D%3D>
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 19 Feb 2024 at 15:09, Jagannath Majhi <
> jagannath.ma...@cloud.cbnits.com> wrote:
>
>> Yes
>>
>> On Mon, Feb 19, 2024, 8:35 PM Mich Talebzadeh 
>> wrote:
>>
>>> OK you have a jar file that you want to work with when running using
>>> Spark on k8s as the execution engine (EKS) as opposed to  YARN on EMR as
>>> the execution engine?
>>>
>>>
>>> Mich Talebzadeh,
>>> Dad | Technologist | Solutions Architect | Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>
>>>
>>> On Mon, 19 Feb 2024 at 14:38, Jagannath Majhi <
>>> jagannath.ma...@cloud.cbnits.com> wrote:
>>>
>>>> I am not using any private docker image. Only I am running the jar file
>>>> in EMR using spark-submit command so now I want to run this jar file in eks
>>>> so can you please tell me how can I set-up for this ??
>>>>
>>>> On Mon, Feb 19, 2024, 8:06 PM Jagannath Majhi <
>>>> jagannath.ma...@cloud.cbnits.com> wrote:
>>>>
>>>>> Can we connect over Google meet??
>>>>>
>>>>> On Mon, Feb 19, 2024, 8:03 PM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Where is your docker file? In ECR container registry.
>>>>>> If you are going to use EKS, then it need to be accessible to all
>>>>>> nodes of cluster
>>>>>>
>>>>>> When you build your docker image, put your jar under the $SPARK_HOME
>>>>>> directory. Then add a line to your docker build file as below
>>>>>> Here I am accessing Google BigQuery DW from EKS
>>>>>> # Add a BigQuery connector jar.
>>>>>> ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/
>>>>>> ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'
>>>>>> RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}" \
>>>>>> && chown spark:spark "${SPARK_EXTRA_JARS_DIR}"
>>>>>> COPY --chown=spark:spark \
>>>>>> spark-bigquery-with-dependencies_2.12-0.22.2.jar
>>>>>> "${SPARK_EXTRA_JARS_DIR}"
>>>>>>
>>>>>> Here I am accessing Google BigQuery DW from EKS cluster
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Dad | Technologist | Solutions Architect | Engineer
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Jagannath Majhi

I am not using any private docker image. Only I am running the jar file in
EMR using spark-submit command so now I want to run this jar file in eks so
can you please tell me how can I set-up for this ??

On Mon, Feb 19, 2024, 8:06 PM Jagannath Majhi <
jagannath.ma...@cloud.cbnits.com> wrote:

> Can we connect over Google meet??
>
> On Mon, Feb 19, 2024, 8:03 PM Mich Talebzadeh 
> wrote:
>
>> Where is your docker file? In ECR container registry.
>> If you are going to use EKS, then it need to be accessible to all nodes
>> of cluster
>>
>> When you build your docker image, put your jar under the $SPARK_HOME
>> directory. Then add a line to your docker build file as below
>> Here I am accessing Google BigQuery DW from EKS
>> # Add a BigQuery connector jar.
>> ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/
>> ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'
>> RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}" \
>> && chown spark:spark "${SPARK_EXTRA_JARS_DIR}"
>> COPY --chown=spark:spark \
>> spark-bigquery-with-dependencies_2.12-0.22.2.jar
>> "${SPARK_EXTRA_JARS_DIR}"
>>
>> Here I am accessing Google BigQuery DW from EKS cluster
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 19 Feb 2024 at 13:42, Jagannath Majhi <
>> jagannath.ma...@cloud.cbnits.com> wrote:
>>
>>> Dear Spark Community,
>>>
>>> I hope this email finds you well. I am reaching out to seek assistance
>>> and guidance regarding a task I'm currently working on involving Apache
>>> Spark.
>>>
>>> I have developed a JAR file that contains some Spark applications and
>>> functionality, and I need to run this JAR file within a Spark cluster.
>>> However, the JAR file is located in an AWS S3 bucket. I'm facing some
>>> challenges in configuring Spark to access and execute this JAR file
>>> directly from the S3 bucket.
>>>
>>> I would greatly appreciate any advice, best practices, or pointers on
>>> how to achieve this integration effectively. Specifically, I'm looking for
>>> insights on:
>>>
>>>1. Configuring Spark to access and retrieve the JAR file from an AWS
>>>S3 bucket.
>>>2. Setting up the necessary permissions and authentication
>>>mechanisms to ensure seamless access to the S3 bucket.
>>>3. Any potential performance considerations or optimizations when
>>>running Spark applications with dependencies stored in remote storage 
>>> like
>>>AWS S3.
>>>
>>> If anyone in the community has prior experience or knowledge in this
>>> area, I would be extremely grateful for your guidance. Additionally, if
>>> there are any relevant resources, documentation, or tutorials that you
>>> could recommend, it would be incredibly helpful.
>>>
>>> Thank you very much for considering my request. I look forward to
>>> hearing from you and benefiting from the collective expertise of the Spark
>>> community.
>>>
>>> Best regards, Jagannath Majhi
>>>
>>

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Mich Talebzadeh

Sure but first it would be beneficial to understand the way Spark works on
Kubernetes and the concept.s

Have a look at this article of mine

Spark on Kubernetes, A Practitioner’s Guide
<https://www.linkedin.com/pulse/spark-kubernetes-practitioners-guide-mich-talebzadeh-ph-d-%3FtrackingId=Wsu3lkoPaCWqGemYHe8%252BLQ%253D%253D/?trackingId=Wsu3lkoPaCWqGemYHe8%2BLQ%3D%3D>

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 19 Feb 2024 at 15:09, Jagannath Majhi <
jagannath.ma...@cloud.cbnits.com> wrote:

> Yes
>
> On Mon, Feb 19, 2024, 8:35 PM Mich Talebzadeh 
> wrote:
>
>> OK you have a jar file that you want to work with when running using
>> Spark on k8s as the execution engine (EKS) as opposed to  YARN on EMR as
>> the execution engine?
>>
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Mon, 19 Feb 2024 at 14:38, Jagannath Majhi <
>> jagannath.ma...@cloud.cbnits.com> wrote:
>>
>>> I am not using any private docker image. Only I am running the jar file
>>> in EMR using spark-submit command so now I want to run this jar file in eks
>>> so can you please tell me how can I set-up for this ??
>>>
>>> On Mon, Feb 19, 2024, 8:06 PM Jagannath Majhi <
>>> jagannath.ma...@cloud.cbnits.com> wrote:
>>>
>>>> Can we connect over Google meet??
>>>>
>>>> On Mon, Feb 19, 2024, 8:03 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Where is your docker file? In ECR container registry.
>>>>> If you are going to use EKS, then it need to be accessible to all
>>>>> nodes of cluster
>>>>>
>>>>> When you build your docker image, put your jar under the $SPARK_HOME
>>>>> directory. Then add a line to your docker build file as below
>>>>> Here I am accessing Google BigQuery DW from EKS
>>>>> # Add a BigQuery connector jar.
>>>>> ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/
>>>>> ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'
>>>>> RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}" \
>>>>> && chown spark:spark "${SPARK_EXTRA_JARS_DIR}"
>>>>> COPY --chown=spark:spark \
>>>>> spark-bigquery-with-dependencies_2.12-0.22.2.jar
>>>>> "${SPARK_EXTRA_JARS_DIR}"
>>>>>
>>>>> Here I am accessing Google BigQuery DW from EKS cluster
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Dad | Technologist | Solutions Architect | Engineer
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>> expert opinions (Werner
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Mich Talebzadeh

OK you have a jar file that you want to work with when running using Spark
on k8s as the execution engine (EKS) as opposed to  YARN on EMR as the
execution engine?

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".

On Mon, 19 Feb 2024 at 14:38, Jagannath Majhi <
jagannath.ma...@cloud.cbnits.com> wrote:

> I am not using any private docker image. Only I am running the jar file in
> EMR using spark-submit command so now I want to run this jar file in eks so
> can you please tell me how can I set-up for this ??
>
> On Mon, Feb 19, 2024, 8:06 PM Jagannath Majhi <
> jagannath.ma...@cloud.cbnits.com> wrote:
>
>> Can we connect over Google meet??
>>
>> On Mon, Feb 19, 2024, 8:03 PM Mich Talebzadeh 
>> wrote:
>>
>>> Where is your docker file? In ECR container registry.
>>> If you are going to use EKS, then it need to be accessible to all nodes
>>> of cluster
>>>
>>> When you build your docker image, put your jar under the $SPARK_HOME
>>> directory. Then add a line to your docker build file as below
>>> Here I am accessing Google BigQuery DW from EKS
>>> # Add a BigQuery connector jar.
>>> ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/
>>> ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'
>>> RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}" \
>>> && chown spark:spark "${SPARK_EXTRA_JARS_DIR}"
>>> COPY --chown=spark:spark \
>>> spark-bigquery-with-dependencies_2.12-0.22.2.jar
>>> "${SPARK_EXTRA_JARS_DIR}"
>>>
>>> Here I am accessing Google BigQuery DW from EKS cluster
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Dad | Technologist | Solutions Architect | Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Mon, 19 Feb 2024 at 13:42, Jagannath Majhi <
>>> jagannath.ma...@cloud.cbnits.com> wrote:
>>>
 Dear Spark Community,

 I hope this email finds you well. I am reaching out to seek assistance
 and guidance regarding a task I'm currently working on involving Apache
 Spark.

 I have developed a JAR file that contains some Spark applications and
 functionality, and I need to run this JAR file within a Spark cluster.
 However, the JAR file is located in an AWS S3 bucket. I'm facing some
 challenges in configuring Spark to access and execute this JAR file
 directly from the S3 bucket.

 I would greatly appreciate any advice, best practices, or pointers on
 how to achieve this integration effectively. Specifically, I'm looking for
 insights on:

1. Configuring Spark to access and retrieve the JAR file from an
AWS S3 bucket.
2. Setting up the necessary permissions and authentication
mechanisms to ensure seamless access to the S3 bucket.
3. Any potential performance considerations or optimizations when
running Spark applications with dependencies stored in remote storage 
 like
AWS S3.

 If anyone in the community has prior experience or knowledge in this
 area, I would be extremely grateful for your guidance. Additionally, if
 there are any relevant resources, documentation, or tutorials that you
 could recommend, it would be incredibly helpful.

 Thank you very much for considering my request. I look forward to
 hearing from you and benefiting from the collective expertise of the Spark
 community.

 Best regards, Jagannath Majhi

>>>

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Mich Talebzadeh

Where is your docker file? In ECR container registry.
If you are going to use EKS, then it need to be accessible to all nodes of
cluster

When you build your docker image, put your jar under the $SPARK_HOME
directory. Then add a line to your docker build file as below
Here I am accessing Google BigQuery DW from EKS
# Add a BigQuery connector jar.
ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/
ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'
RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}" \
&& chown spark:spark "${SPARK_EXTRA_JARS_DIR}"
COPY --chown=spark:spark \
spark-bigquery-with-dependencies_2.12-0.22.2.jar
"${SPARK_EXTRA_JARS_DIR}"

Here I am accessing Google BigQuery DW from EKS cluster

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".

On Mon, 19 Feb 2024 at 13:42, Jagannath Majhi <
jagannath.ma...@cloud.cbnits.com> wrote:

> Dear Spark Community,
>
> I hope this email finds you well. I am reaching out to seek assistance and
> guidance regarding a task I'm currently working on involving Apache Spark.
>
> I have developed a JAR file that contains some Spark applications and
> functionality, and I need to run this JAR file within a Spark cluster.
> However, the JAR file is located in an AWS S3 bucket. I'm facing some
> challenges in configuring Spark to access and execute this JAR file
> directly from the S3 bucket.
>
> I would greatly appreciate any advice, best practices, or pointers on how
> to achieve this integration effectively. Specifically, I'm looking for
> insights on:
>
>1. Configuring Spark to access and retrieve the JAR file from an AWS
>S3 bucket.
>2. Setting up the necessary permissions and authentication mechanisms
>to ensure seamless access to the S3 bucket.
>3. Any potential performance considerations or optimizations when
>running Spark applications with dependencies stored in remote storage like
>AWS S3.
>
> If anyone in the community has prior experience or knowledge in this area,
> I would be extremely grateful for your guidance. Additionally, if there are
> any relevant resources, documentation, or tutorials that you could
> recommend, it would be incredibly helpful.
>
> Thank you very much for considering my request. I look forward to hearing
> from you and benefiting from the collective expertise of the Spark
> community.
>
> Best regards, Jagannath Majhi
>

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Richard Smith

I run my Spark jobs in GCP with Google Dataproc using GCS buckets.

I've not used AWS, but its EMR product offers similar functionality to
Dataproc. The title of your post implies your Spark cluster runs on EKS.
You might be better off using EMR, see links below:

EMR
https://medium.com/big-data-on-amazon-elastic-mapreduce/run-a-spark-job-within-amazon-emr-in-15-minutes-68b02af1ae16

EKS https://medium.com/@vikas.navlani/running-spark-on-aws-eks-1cd4c31786c

Richard

On 19/02/2024 13:36, Jagannath Majhi wrote:

Dear Spark Community,

I hope this email finds you well. I am reaching out to seek assistance
and guidance regarding a task I'm currently working on involving
Apache Spark.

I have developed a JAR file that contains some Spark applications and
functionality, and I need to run this JAR file within a Spark cluster.
However, the JAR file is located in an AWS S3 bucket. I'm facing some
challenges in configuring Spark to access and execute this JAR file
directly from the S3 bucket.

I would greatly appreciate any advice, best practices, or pointers on
how to achieve this integration effectively. Specifically, I'm looking
for insights on:

1. Configuring Spark to access and retrieve the JAR file from an AWS
S3 bucket.
2. Setting up the necessary permissions and authentication mechanisms
to ensure seamless access to the S3 bucket.
3. Any potential performance considerations or optimizations when
running Spark applications with dependencies stored in remote
storage like AWS S3.

If anyone in the community has prior experience or knowledge in this
area, I would be extremely grateful for your guidance. Additionally,
if there are any relevant resources, documentation, or tutorials that
you could recommend, it would be incredibly helpful.

Thank you very much for considering my request. I look forward to
hearing from you and benefiting from the collective expertise of the
Spark community.

Best regards, Jagannath Majhi

Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Jagannath Majhi

Dear Spark Community,

I hope this email finds you well. I am reaching out to seek assistance and
guidance regarding a task I'm currently working on involving Apache Spark.

I have developed a JAR file that contains some Spark applications and
functionality, and I need to run this JAR file within a Spark cluster.
However, the JAR file is located in an AWS S3 bucket. I'm facing some
challenges in configuring Spark to access and execute this JAR file
directly from the S3 bucket.

I would greatly appreciate any advice, best practices, or pointers on how
to achieve this integration effectively. Specifically, I'm looking for
insights on:

   1. Configuring Spark to access and retrieve the JAR file from an AWS S3
   bucket.
   2. Setting up the necessary permissions and authentication mechanisms to
   ensure seamless access to the S3 bucket.
   3. Any potential performance considerations or optimizations when
   running Spark applications with dependencies stored in remote storage like
   AWS S3.

If anyone in the community has prior experience or knowledge in this area,
I would be extremely grateful for your guidance. Additionally, if there are
any relevant resources, documentation, or tutorials that you could
recommend, it would be incredibly helpful.

Thank you very much for considering my request. I look forward to hearing
from you and benefiting from the collective expertise of the Spark
community.

Best regards, Jagannath Majhi

Elasticity and scalability for Spark in Kubernetes

2023-10-30 Thread Mich Talebzadeh

I  was thinking in line of elasticity and autoscaling for Spark in the
context of Kubernetes. My experience with Kubernetes and Spark on the so
called autopilot has not been that great.This is mainly from my experience
that in autopilot you let the choice of nodes be decided by the vendor's
default configuration. Autopilot assumes that you can scale horizontally if
resource allocation is not there. However, this does not take into account,
if you start a k8s node of 4GB which is totally inadequate for a spark job
with moderate loads. Simply the driver pod fails to create and autopilot
starts building the cluster again. causing the delay. Sure it can start
with a larger node size and it might get there eventually at a considerable
delay.

Vertical elasticity refers to the ability of a single application instance
to scale its resources up or down. This can be done by adjusting the amount
of memory, CPU, or storage allocated to the application.

Horizontal autoscaling refers to the ability to automatically add or remove
application instances based on the workload. This is typically done by
monitoring the application's performance metrics, such as CPU utilization,
memory usage, or request latency.

Vertical elasticity


   - Memory: The amount of memory allocated to each Spark executor.
   - CPU: The number of CPU cores allocated to each Spark executor.
   - Storage: The amount of storage allocated to each Spark executor.


Horizontal autoscaling


   - Minimum number of executors: The minimum number of executors that
   should be running at any given time.
   - Maximum number of executors: The maximum number of executors that can
   be running at any given time.
   - Target CPU utilization: The desired CPU utilization for the cluster.
   - Target memory utilization: The desired memory utilization for the
   cluster.
   - Target request latency: The desired request latency for the
   application.


For example, in Python I would have these:


# Setting the horizontal autoscaling parameters

spark.conf.set('spark.dynamicAllocation.enabled', 'true') spark.conf.set(
'spark.dynamicAllocation.minExecutors', min_instances) spark.conf.set(
'spark.dynamicAllocation.maxExecutors', max_instances) spark.conf.set(
'spark.dynamicAllocation.targetExecutorIdleTime', 30) spark.conf.set(
'spark.dynamicAllocation.initialExecutors', 4)
spark.conf.set('spark.dynamicAllocation.targetRequestLatency', 100)

I have have also set the following properties, which are not strictly
necessary for horizontal autoscaling, but which can be helpful

   - target_memory_utilization: This property specifies the desired memory
   utilization for the application cluster.
   - target_request_latency: This property specifies the desired request
   latency for the application cluster.


spark.conf.set('target_request_latency '.100)
spark.conf.set('target_memory_utilization', 60)
Anyway this a sample of parameters that I use in spark-submit

spark-submit --verbose \
   --properties-file ${property_file} \
   --master k8s://https://$KUBERNETES_MASTER_IP:443 \
   --deploy-mode cluster \
   --name $APPNAME \
   --py-files $CODE_DIRECTORY_CLOUD/spark_on_gke.zip \
   --conf spark.kubernetes.namespace=$NAMESPACE \
   --conf spark.network.timeout=300 \
   --conf spark.kubernetes.allocation.batch.size=3 \
   --conf spark.kubernetes.allocation.batch.delay=1 \
   --conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \
   --conf spark.kubernetes.executor.container.image=${IMAGEDRIVER} \
   --conf spark.kubernetes.driver.pod.name=$APPNAME \
   --conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \
   --conf
spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
   --conf
spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
\
   --conf spark.dynamicAllocation.enabled=true \
   --conf spark.dynamicAllocation.shuffleTracking.enabled=true \
   --conf spark.dynamicAllocation.shuffleTracking.timeout=20s \
   --conf spark.dynamicAllocation.executorIdleTimeout=30s \
   --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=40s \
   --conf spark.dynamicAllocation.minExecutors=0 \
   --conf spark.dynamicAllocation.maxExecutors=20 \
   --conf spark.driver.cores=3 \
   --conf spark.executor.cores=3 \
   --conf spark.driver.memory=1024m \
   --conf spark.executor.memory=1024m \
   $CODE_DIRECTORY_CLOUD/${APPLICATION}

Note that I have kept the memory low (both the driver and executor) to move
the submit job from Pending to Running state. This is by no means optimum
but I like to explore ideas on it.
with the other members.

Thanks

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom

   view my Linkedin profile

 http

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jon Rodríguez Aranguren

Dear Jörn Franke, Jayabindu Singh and Spark Community members,

Thank you profoundly for your initial insights. I feel it's necessary to
provide more precision on our setup to facilitate a deeper understanding.

We're interfacing with S3 Compatible storages, but our operational context
is somewhat distinct. Our infrastructure doesn't lean on conventional cloud
providers like AWS. Instead, we've architected our environment on
On-Premise Kubernetes distributions, specifically k0s and Openshift.

Our objective extends beyond just handling S3 keys. We're orchestrating a
solution that integrates Azure SPNs, API Credentials, and other sensitive
credentials, intending to make Kubernetes' native secrets our central
management hub. The aspiration is to have a universally deployable JAR, one
that can function unmodified across different ecosystems like EMR,
Databricks (on both AWS and Azure), etc. Platforms like Databricks have
already made strides in this direction, allowing secrets to be woven
directly into the Spark Conf through mechanisms like
{{secret_scope/secret_name}}, which are resolved dynamically.

The spark-on-k8s-operator's user guide suggests the feasibility of mounting
secrets. However, a gap exists in our understanding of how to subsequently
access these mounted secret values within the Spark application's context.

Here lies my inquiry: is the spark-on-k8s-operator currently equipped to
support this level of integration? If it does, any elucidation on the
method or best practices would be pivotal for our project. Alternatively,
if you could point me to resources or community experts who have tackled
similar challenges, it would be of immense assistance.

Thank you for bearing with the intricacies of our query, and I appreciate
your continued guidance in this endeavor.

Warm regards,

Jon Rodríguez Aranguren.

El sáb, 30 sept 2023 a las 23:19, Jayabindu Singh ()
escribió:

> Hi Jon,
>
> Using IAM as suggested by Jorn is the best approach.
> We recently moved our spark workload from HDP to Spark on K8 and utilizing
> IAM.
> It will save you from secret management headaches and also allows a lot
> more flexibility on access control and option to allow access to multiple
> S3 buckets in the same pod.
> We have implemented this across Azure, Google and AWS. Azure does require
> some extra work to make it work.
>
> On Sat, Sep 30, 2023 at 12:05 PM Jörn Franke  wrote:
>
>> Don’t use static iam (s3) credentials. It is an outdated insecure method
>> - even AWS recommend against using this for anything (cf eg
>> https://docs.aws.amazon.com/cli/latest/userguide/cli-authentication-user.html
>> ).
>> It is almost a guarantee to get your data stolen and your account
>> manipulated.
>>
>> If you need to use kubernetes (which has its own very problematic
>> security issues) then assign AWS IAM roles with minimal permissions to the
>> pods (for EKS it means using OIDC, cf
>> https://docs.aws.amazon.com/eks/latest/userguide/service_IAM_role.html).
>>
>> Am 30.09.2023 um 03:41 schrieb Jon Rodríguez Aranguren <
>> jon.r.arangu...@gmail.com>:
>>
>> 
>> Dear Spark Community Members,
>>
>> I trust this message finds you all in good health and spirits.
>>
>> I'm reaching out to the collective expertise of this esteemed community
>> with a query regarding Spark on Kubernetes. As a newcomer, I have always
>> admired the depth and breadth of knowledge shared within this forum, and it
>> is my hope that some of you might have insights on a specific challenge I'm
>> facing.
>>
>> I am currently trying to configure multiple Kubernetes secrets, notably
>> multiple S3 keys, at the SparkConf level for a Spark application. My
>> objective is to understand the best approach or methods to ensure that
>> these secrets can be smoothly accessed by the Spark application.
>>
>> If any of you have previously encountered this scenario or possess
>> relevant insights on the matter, your guidance would be highly beneficial.
>>
>> Thank you for your time and consideration. I'm eager to learn from the
>> experiences and knowledge present within this community.
>>
>> Warm regards,
>> Jon
>>
>>

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jörn Franke

There is nowadays more a trend to move away from static credentials/certificates that are stored in a secret vault. The issue is that the rotation of them is complex, once they are leaked they can be abused, making minimal permissions feasible is cumbersome etc. That is why keyless approaches are used for A2A access (workload identity federation was mentioned). E.g. in AWS EKS you would build this on oidc (https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html) and configure this instead of using secrets. Similar approaches exist in other clouds and even on-premise (eg SPIFFE https://spiffe.io/).If this will become the standard will be difficult to say - for sure they seem to more easier to manage.Since you seem to have a Kubernetes setup which means per cloud/data Centre a lot of extra work, infrastructure cost and security issues, workload Identity federation may ease this compared to a secret store.Am 01.10.2023 um 08:27 schrieb Jon Rodríguez Aranguren :Dear Jörn Franke, Jayabindu Singh and Spark Community members,Thank you profoundly for your initial insights. I feel it's necessary to provide more precision on our setup to facilitate a deeper understanding.We're interfacing with S3 Compatible storages, but our operational context is somewhat distinct. Our infrastructure doesn't lean on conventional cloud providers like AWS. Instead, we've architected our environment on On-Premise Kubernetes distributions, specifically k0s and Openshift.Our objective extends beyond just handling S3 keys. We're orchestrating a solution that integrates Azure SPNs, API Credentials, and other sensitive credentials, intending to make Kubernetes' native secrets our central management hub. The aspiration is to have a universally deployable JAR, one that can function unmodified across different ecosystems like EMR, Databricks (on both AWS and Azure), etc. Platforms like Databricks have already made strides in this direction, allowing secrets to be woven directly into the Spark Conf through mechanisms like {{secret_scope/secret_name}}, which are resolved dynamically.The spark-on-k8s-operator's user guide suggests the feasibility of mounting secrets. However, a gap exists in our understanding of how to subsequently access these mounted secret values within the Spark application's context.Here lies my inquiry: is the spark-on-k8s-operator currently equipped to support this level of integration? If it does, any elucidation on the method or best practices would be pivotal for our project. Alternatively, if you could point me to resources or community experts who have tackled similar challenges, it would be of immense assistance.Thank you for bearing with the intricacies of our query, and I appreciate your continued guidance in this endeavor.Warm regards,Jon Rodríguez Aranguren.El sáb, 30 sept 2023 a las 23:19, Jayabindu Singh (<jayabi...@gmail.com>) escribió:Hi Jon,Using IAM as suggested by Jorn is the best approach.We recently moved our spark workload from HDP to Spark on K8 and utilizing IAM.It will save you from secret management headaches and also allows a lot more flexibility on access control and option to allow access to multiple S3 buckets in the same pod. We have implemented this across Azure, Google and AWS. Azure does require some extra work to make it work.On Sat, Sep 30, 2023 at 12:05 PM Jörn Franke <jornfra...@gmail.com> wrote:Don’t use static iam (s3) credentials. It is an outdated insecure method - even AWS recommend against using this for anything (cf eg https://docs.aws.amazon.com/cli/latest/userguide/cli-authentication-user.html).It is almost a guarantee to get your data stolen and your account manipulated. If you need to use kubernetes (which has its own very problematic security issues) then assign AWS IAM roles with minimal permissions to the pods (for EKS it means using OIDC, cf https://docs.aws.amazon.com/eks/latest/userguide/service_IAM_role.html).Am 30.09.2023 um 03:41 schrieb Jon Rodríguez Aranguren <jon.r.arangu...@gmail.com>:Dear Spark Community Members,I trust this message finds you all in good health and spirits.I'm reaching out to the collective expertise of this esteemed community with a query regarding Spark on Kubernetes. As a newcomer, I have always admired the depth and breadth of knowledge shared within this forum, and it is my hope that some of you might have insights on a specific challenge I'm facing.I am currently trying to configure multiple Kubernetes secrets, notably multiple S3 keys, at the SparkConf level for a Spark application. My objective is to understand the best approach or methods to ensure that these secrets can be smoothly accessed by the Spark application.If any of you have previously encountered this scenario or possess relevant insights on the matter, your guidance would be highly beneficial.Thank you for your time and consideration. I'm eager to learn from the experiences and knowledge present within this community.Warm regards,Jon

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jörn Franke

With oidc sth comparable is possible: https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.htmlAm 01.10.2023 um 11:13 schrieb Mich Talebzadeh :It seems that workload identity is not available on AWS. Workload Identity replaces the need to use Metadata concealment on exposed storage such as s3 and gcs. The sensitive metadata protected by metadata concealment is also protected by Workload Identity.Both Google Cloud Kubernetes (GKE) and Azure Kubernetes Service support Workload Identity. Taking notes from Google Cloud:  "Workload Identity is the recommended way for your workloads running on Google Kubernetes Engine (GKE) to access Google Cloud services in a secure and manageable way."HTH

Mich Talebzadeh,Distinguished Technologist, Solutions Architect & EngineerLondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  

On Sun, 1 Oct 2023 at 06:36, Jayabindu Singh <jayabi...@gmail.com> wrote:Hi Jon,Using IAM as suggested by Jorn is the best approach.We recently moved our spark workload from HDP to Spark on K8 and utilizing IAM.It will save you from secret management headaches and also allows a lot more flexibility on access control and option to allow access to multiple S3 buckets in the same pod. We have implemented this across Azure, Google and AWS. Azure does require some extra work to make it work.On Sat, Sep 30, 2023 at 12:05 PM Jörn Franke <jornfra...@gmail.com> wrote:Don’t use static iam (s3) credentials. It is an outdated insecure method - even AWS recommend against using this for anything (cf eg https://docs.aws.amazon.com/cli/latest/userguide/cli-authentication-user.html).It is almost a guarantee to get your data stolen and your account manipulated. If you need to use kubernetes (which has its own very problematic security issues) then assign AWS IAM roles with minimal permissions to the pods (for EKS it means using OIDC, cf https://docs.aws.amazon.com/eks/latest/userguide/service_IAM_role.html).Am 30.09.2023 um 03:41 schrieb Jon Rodríguez Aranguren <jon.r.arangu...@gmail.com>:Dear Spark Community Members,I trust this message finds you all in good health and spirits.I'm reaching out to the collective expertise of this esteemed community with a query regarding Spark on Kubernetes. As a newcomer, I have always admired the depth and breadth of knowledge shared within this forum, and it is my hope that some of you might have insights on a specific challenge I'm facing.I am currently trying to configure multiple Kubernetes secrets, notably multiple S3 keys, at the SparkConf level for a Spark application. My objective is to understand the best approach or methods to ensure that these secrets can be smoothly accessed by the Spark application.If any of you have previously encountered this scenario or possess relevant insights on the matter, your guidance would be highly beneficial.Thank you for your time and consideration. I'm eager to learn from the experiences and knowledge present within this community.Warm regards,Jon

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Mich Talebzadeh

It seems that workload identity
<https://cloud.google.com/iam/docs/workload-identity-federation> is not
available on AWS. Workload Identity replaces the need to use Metadata
concealment on exposed storage such as s3 and gcs. The sensitive metadata
protected by metadata concealment is also protected by Workload Identity.

Both Google Cloud Kubernetes (GKE
<https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity>)
and Azure Kubernetes Servi
<https://learn.microsoft.com/en-us/azure/aks/workload-identity-overview>ce
support Workload Identity. Taking notes from Google Cloud:  "Workload
Identity is the recommended way for your workloads running on Google
Kubernetes Engine (GKE) to access Google Cloud services in a secure and
manageable way."

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Sun, 1 Oct 2023 at 06:36, Jayabindu Singh  wrote:

> Hi Jon,
>
> Using IAM as suggested by Jorn is the best approach.
> We recently moved our spark workload from HDP to Spark on K8 and utilizing
> IAM.
> It will save you from secret management headaches and also allows a lot
> more flexibility on access control and option to allow access to multiple
> S3 buckets in the same pod.
> We have implemented this across Azure, Google and AWS. Azure does require
> some extra work to make it work.
>
> On Sat, Sep 30, 2023 at 12:05 PM Jörn Franke  wrote:
>
>> Don’t use static iam (s3) credentials. It is an outdated insecure method
>> - even AWS recommend against using this for anything (cf eg
>> https://docs.aws.amazon.com/cli/latest/userguide/cli-authentication-user.html
>> ).
>> It is almost a guarantee to get your data stolen and your account
>> manipulated.
>>
>> If you need to use kubernetes (which has its own very problematic
>> security issues) then assign AWS IAM roles with minimal permissions to the
>> pods (for EKS it means using OIDC, cf
>> https://docs.aws.amazon.com/eks/latest/userguide/service_IAM_role.html).
>>
>> Am 30.09.2023 um 03:41 schrieb Jon Rodríguez Aranguren <
>> jon.r.arangu...@gmail.com>:
>>
>> 
>> Dear Spark Community Members,
>>
>> I trust this message finds you all in good health and spirits.
>>
>> I'm reaching out to the collective expertise of this esteemed community
>> with a query regarding Spark on Kubernetes. As a newcomer, I have always
>> admired the depth and breadth of knowledge shared within this forum, and it
>> is my hope that some of you might have insights on a specific challenge I'm
>> facing.
>>
>> I am currently trying to configure multiple Kubernetes secrets, notably
>> multiple S3 keys, at the SparkConf level for a Spark application. My
>> objective is to understand the best approach or methods to ensure that
>> these secrets can be smoothly accessed by the Spark application.
>>
>> If any of you have previously encountered this scenario or possess
>> relevant insights on the matter, your guidance would be highly beneficial.
>>
>> Thank you for your time and consideration. I'm eager to learn from the
>> experiences and knowledge present within this community.
>>
>> Warm regards,
>> Jon
>>
>>

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-30 Thread Jayabindu Singh

Hi Jon,

Using IAM as suggested by Jorn is the best approach.
We recently moved our spark workload from HDP to Spark on K8 and utilizing
IAM.
It will save you from secret management headaches and also allows a lot
more flexibility on access control and option to allow access to multiple
S3 buckets in the same pod.
We have implemented this across Azure, Google and AWS. Azure does require
some extra work to make it work.

On Sat, Sep 30, 2023 at 12:05 PM Jörn Franke  wrote:

> Don’t use static iam (s3) credentials. It is an outdated insecure method -
> even AWS recommend against using this for anything (cf eg
> https://docs.aws.amazon.com/cli/latest/userguide/cli-authentication-user.html
> ).
> It is almost a guarantee to get your data stolen and your account
> manipulated.
>
> If you need to use kubernetes (which has its own very problematic security
> issues) then assign AWS IAM roles with minimal permissions to the pods (for
> EKS it means using OIDC, cf
> https://docs.aws.amazon.com/eks/latest/userguide/service_IAM_role.html).
>
> Am 30.09.2023 um 03:41 schrieb Jon Rodríguez Aranguren <
> jon.r.arangu...@gmail.com>:
>
> 
> Dear Spark Community Members,
>
> I trust this message finds you all in good health and spirits.
>
> I'm reaching out to the collective expertise of this esteemed community
> with a query regarding Spark on Kubernetes. As a newcomer, I have always
> admired the depth and breadth of knowledge shared within this forum, and it
> is my hope that some of you might have insights on a specific challenge I'm
> facing.
>
> I am currently trying to configure multiple Kubernetes secrets, notably
> multiple S3 keys, at the SparkConf level for a Spark application. My
> objective is to understand the best approach or methods to ensure that
> these secrets can be smoothly accessed by the Spark application.
>
> If any of you have previously encountered this scenario or possess
> relevant insights on the matter, your guidance would be highly beneficial.
>
> Thank you for your time and consideration. I'm eager to learn from the
> experiences and knowledge present within this community.
>
> Warm regards,
> Jon
>
>

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-30 Thread Jörn Franke

Don’t use static iam (s3) credentials. It is an outdated insecure method - even 
AWS recommend against using this for anything (cf eg 
https://docs.aws.amazon.com/cli/latest/userguide/cli-authentication-user.html).
It is almost a guarantee to get your data stolen and your account manipulated. 

If you need to use kubernetes (which has its own very problematic security 
issues) then assign AWS IAM roles with minimal permissions to the pods (for EKS 
it means using OIDC, cf 
https://docs.aws.amazon.com/eks/latest/userguide/service_IAM_role.html).

> Am 30.09.2023 um 03:41 schrieb Jon Rodríguez Aranguren 
> :
> 
> 
> Dear Spark Community Members,
> 
> I trust this message finds you all in good health and spirits.
> 
> I'm reaching out to the collective expertise of this esteemed community with 
> a query regarding Spark on Kubernetes. As a newcomer, I have always admired 
> the depth and breadth of knowledge shared within this forum, and it is my 
> hope that some of you might have insights on a specific challenge I'm facing.
> 
> I am currently trying to configure multiple Kubernetes secrets, notably 
> multiple S3 keys, at the SparkConf level for a Spark application. My 
> objective is to understand the best approach or methods to ensure that these 
> secrets can be smoothly accessed by the Spark application.
> 
> If any of you have previously encountered this scenario or possess relevant 
> insights on the matter, your guidance would be highly beneficial.
> 
> Thank you for your time and consideration. I'm eager to learn from the 
> experiences and knowledge present within this community.
> 
> Warm regards,
> Jon

Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-29 Thread Jon Rodríguez Aranguren

Dear Spark Community Members,

I trust this message finds you all in good health and spirits.

I'm reaching out to the collective expertise of this esteemed community
with a query regarding Spark on Kubernetes. As a newcomer, I have always
admired the depth and breadth of knowledge shared within this forum, and it
is my hope that some of you might have insights on a specific challenge I'm
facing.

I am currently trying to configure multiple Kubernetes secrets, notably
multiple S3 keys, at the SparkConf level for a Spark application. My
objective is to understand the best approach or methods to ensure that
these secrets can be smoothly accessed by the Spark application.

If any of you have previously encountered this scenario or possess relevant
insights on the matter, your guidance would be highly beneficial.

Thank you for your time and consideration. I'm eager to learn from the
experiences and knowledge present within this community.

Warm regards,
Jon

SPIP: Adding work load identity to Spark on Kubernetes documents (supersedes Secret Management)

2023-02-20 Thread Mich Talebzadeh

Hi,

I would like to propose that the current Secret Management
<https://spark.apache.org/docs/latest/running-on-kubernetes.html#secret-management>
in
Spark Kubernetes documentation to include the more secure credentials
Workload identity) for Spark to access Kubernetes services.


Both Google Cloud Kubernetes (GKE
<https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity>)
and Azure Kubernetes Servi
<https://learn.microsoft.com/en-us/azure/aks/workload-identity-overview>ce
support Workload Identity.


Taking notes from Google Cloud "Workload Identity is the recommended way
for your workloads running on Google Kubernetes Engine (GKE) to access
Google Cloud services in a secure and manageable way."


Workload Identity replaces the need to use Metadata concealment. The
sensitive metadata protected by metadata concealment is also protected by
Workload Identity.


In the usual way we had secret management that had to be put on a
shared drive that nodes of K8s cluster could access it. Thi was normally on
Cloud Storage and exposed the following


kubectl create secret generic spark-sa  --namespace=spark
--from-file=./spark-sa.json


that spark-sa.json file contained the following:


{
  "type": "service_account",
  "project_id": "",
  "private_key_id": "7a0d67d19c5d74337792c2320d698085e9",
  "private_key": "-BEGIN PRIVATE
KEY-\nMIIEvQIBADANBgkqhkiG9w0BAQEFAASCBKcwggSjAgEAAoIBAQDTsqxsyqDP4ViY\nhO0y7INv+tr8pKEz630DOkjI/kzKfvYelzlrjZ+/EAkqOymCzIIF1LsRG8y//G3/\nzUGR2tcUKbeEaeaJJtG3tGJfCnEoApL3+jA7OvNEbJoeFsMgZ82cDXeZtYdmPdX0\nd1gwpb1yrzBckecsuG0yHs0biz9pwR7xvIPjEo26AcrFvQeOLY2P60UM40AED0F+\n23QtlsXBTjMaWih020fWNlVJSaA+FkVGfSMgQ233/5qeVeLOIBJ9BDgxf4M9OYZO\n

.

PRIVATE KEY-\n",
  "client_email": "spark-bq@.iam.gserviceaccount.com",
  "client_id": "10032476552331",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth;,
  "token_uri": "https://oauth2.googleapis.com/token;,
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs
",
  "client_x509_cert_url": "
https://www.googleapis.com/robot/v1/metadata/x509/spark-bq%.
iam.gserviceaccount.com"
}


Cloud service account keys do not expire and require manual rotation.
Exporting service account keys has the potential to expand the scope of a
security breach if it goes undetected. If an exported key is stolen, an
attacker can use it to authenticate as that service account until noticed
and manually the key is revoked.That has a lot of stuff that could be read
on the mount directory.


Let me know your thoughts


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-16 Thread Mich Talebzadeh

You can try this

gsutil cp src/StructuredStream-on-gke.py gs://codes/

where you create a bucket on gcs called codes


Then in you spark-submit do


spark-submit --verbose \
   --master k8s://https://$KUBERNETES_MASTER_IP:443 \
   --deploy-mode cluster \
   --name  \
  --conf spark.kubernetes.driver.container.image=pyspark-example:0.1
  \

   --conf spark.kubernetes.executor.container.image=
pyspark-example:0.1  \

gs://codes/StructuredStream-on-gke.py



HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 15 Feb 2023 at 21:17, karan alang  wrote:

> thnks, Mich .. let me check this
>
>
>
> On Wed, Feb 15, 2023 at 1:42 AM Mich Talebzadeh 
> wrote:
>
>>
>> It may help to check this article of mine
>>
>>
>> Spark on Kubernetes, A Practitioner’s Guide
>> <https://www.linkedin.com/pulse/spark-kubernetes-practitioners-guide-mich-talebzadeh-ph-d-/?trackingId=FDQORri0TBeJl02p3D%2B2JA%3D%3D>
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 15 Feb 2023 at 09:12, Mich Talebzadeh 
>> wrote:
>>
>>> Your submit command
>>>
>>> spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode
>>> cluster --name pyspark-example --conf 
>>> spark.kubernetes.container.image=pyspark-example:0.1
>>> --conf spark.kubernetes.file.upload.path=/myexample
>>> src/StructuredStream-on-gke.py
>>>
>>>
>>> pay attention to what it says
>>>
>>>
>>> --conf spark.kubernetes.file.upload.path
>>>
>>> That refers to your Python package on GCS storage not in the docker
>>> itself
>>>
>>>
>>> From
>>> https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management
>>>
>>>
>>> "... The app jar file will be uploaded to the S3 and then when the
>>> driver is launched it will be downloaded to the driver pod and will be
>>> added to its classpath. Spark will generate a subdir under the upload path
>>> with a random name to avoid conflicts with spark apps running in parallel.
>>> User could manage the subdirs created according to his needs..."
>>>
>>>
>>> In your case it is gs not s3
>>>
>>>
>>> There is no point putting your python file in the docker image itself!
>>>
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 15 Feb 2023 at 07:46, karan alang  wrote:
>>>
>>>> Hi Ye,
>>>>
>>>> This is the error i get when i don't set the
>>>> spark.kubernetes.file.upload.path
>>>>
>>>> Any ideas on how to fix this ?
>>>>
>>>> ```
>>>>
>>>> Exception in thread "main" org.apache.spark.SparkException: Please
>>>> specify spark.kubernetes.file.upload.path property.
>>>>
>>>> at
>>>> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(Kuber

Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-15 Thread karan alang

thnks, Mich .. let me check this



On Wed, Feb 15, 2023 at 1:42 AM Mich Talebzadeh 
wrote:

>
> It may help to check this article of mine
>
>
> Spark on Kubernetes, A Practitioner’s Guide
> <https://www.linkedin.com/pulse/spark-kubernetes-practitioners-guide-mich-talebzadeh-ph-d-/?trackingId=FDQORri0TBeJl02p3D%2B2JA%3D%3D>
>
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 15 Feb 2023 at 09:12, Mich Talebzadeh 
> wrote:
>
>> Your submit command
>>
>> spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode
>> cluster --name pyspark-example --conf 
>> spark.kubernetes.container.image=pyspark-example:0.1
>> --conf spark.kubernetes.file.upload.path=/myexample
>> src/StructuredStream-on-gke.py
>>
>>
>> pay attention to what it says
>>
>>
>> --conf spark.kubernetes.file.upload.path
>>
>> That refers to your Python package on GCS storage not in the docker itself
>>
>>
>> From
>> https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management
>>
>>
>> "... The app jar file will be uploaded to the S3 and then when the
>> driver is launched it will be downloaded to the driver pod and will be
>> added to its classpath. Spark will generate a subdir under the upload path
>> with a random name to avoid conflicts with spark apps running in parallel.
>> User could manage the subdirs created according to his needs..."
>>
>>
>> In your case it is gs not s3
>>
>>
>> There is no point putting your python file in the docker image itself!
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 15 Feb 2023 at 07:46, karan alang  wrote:
>>
>>> Hi Ye,
>>>
>>> This is the error i get when i don't set the
>>> spark.kubernetes.file.upload.path
>>>
>>> Any ideas on how to fix this ?
>>>
>>> ```
>>>
>>> Exception in thread "main" org.apache.spark.SparkException: Please
>>> specify spark.kubernetes.file.upload.path property.
>>>
>>> at
>>> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:299)
>>>
>>> at
>>> org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:248)
>>>
>>> at
>>> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>>>
>>> at
>>> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>>>
>>> at
>>> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>>>
>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>>>
>>> at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>>>
>>> at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>>>
>>> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>>>
>>> at
>>> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:247)
>>>
>>> at
>>> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:173)
>>>
>>> at scala.collection.immutable.List.foreach(List.scala:392)
>>>
>>> at
>>> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:164)
>>&

Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-15 Thread Mich Talebzadeh

It may help to check this article of mine


Spark on Kubernetes, A Practitioner’s Guide
<https://www.linkedin.com/pulse/spark-kubernetes-practitioners-guide-mich-talebzadeh-ph-d-/?trackingId=FDQORri0TBeJl02p3D%2B2JA%3D%3D>


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 15 Feb 2023 at 09:12, Mich Talebzadeh 
wrote:

> Your submit command
>
> spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode
> cluster --name pyspark-example --conf 
> spark.kubernetes.container.image=pyspark-example:0.1
> --conf spark.kubernetes.file.upload.path=/myexample
> src/StructuredStream-on-gke.py
>
>
> pay attention to what it says
>
>
> --conf spark.kubernetes.file.upload.path
>
> That refers to your Python package on GCS storage not in the docker itself
>
>
> From
> https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management
>
>
> "... The app jar file will be uploaded to the S3 and then when the driver
> is launched it will be downloaded to the driver pod and will be added to
> its classpath. Spark will generate a subdir under the upload path with a
> random name to avoid conflicts with spark apps running in parallel. User
> could manage the subdirs created according to his needs..."
>
>
> In your case it is gs not s3
>
>
> There is no point putting your python file in the docker image itself!
>
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 15 Feb 2023 at 07:46, karan alang  wrote:
>
>> Hi Ye,
>>
>> This is the error i get when i don't set the
>> spark.kubernetes.file.upload.path
>>
>> Any ideas on how to fix this ?
>>
>> ```
>>
>> Exception in thread "main" org.apache.spark.SparkException: Please
>> specify spark.kubernetes.file.upload.path property.
>>
>> at
>> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:299)
>>
>> at
>> org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:248)
>>
>> at
>> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>>
>> at
>> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>>
>> at
>> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>>
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>>
>> at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>>
>> at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>>
>> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>>
>> at
>> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:247)
>>
>> at
>> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:173)
>>
>> at scala.collection.immutable.List.foreach(List.scala:392)
>>
>> at
>> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:164)
>>
>> at
>> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60)
>>
>> at
>> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
>>
>> at
>> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
>>
>> at scala.collection.immutable.List.foldLeft(List.scala:89)
>>
>> at
>> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)
>>
>> at
>> org.apache.spark.deploy.k8s.submit.Client.r

Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-15 Thread Mich Talebzadeh

Your submit command

spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode cluster
--name pyspark-example --conf
spark.kubernetes.container.image=pyspark-example:0.1
--conf spark.kubernetes.file.upload.path=/myexample
src/StructuredStream-on-gke.py


pay attention to what it says


--conf spark.kubernetes.file.upload.path

That refers to your Python package on GCS storage not in the docker itself


From
https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management


"... The app jar file will be uploaded to the S3 and then when the driver
is launched it will be downloaded to the driver pod and will be added to
its classpath. Spark will generate a subdir under the upload path with a
random name to avoid conflicts with spark apps running in parallel. User
could manage the subdirs created according to his needs..."


In your case it is gs not s3


There is no point putting your python file in the docker image itself!


HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 15 Feb 2023 at 07:46, karan alang  wrote:

> Hi Ye,
>
> This is the error i get when i don't set the
> spark.kubernetes.file.upload.path
>
> Any ideas on how to fix this ?
>
> ```
>
> Exception in thread "main" org.apache.spark.SparkException: Please specify
> spark.kubernetes.file.upload.path property.
>
> at
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:299)
>
> at
> org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:248)
>
> at
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>
> at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>
> at
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>
> at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>
> at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>
> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>
> at
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:247)
>
> at
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:173)
>
> at scala.collection.immutable.List.foreach(List.scala:392)
>
> at
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:164)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60)
>
> at
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
>
> at
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
>
> at scala.collection.immutable.List.foldLeft(List.scala:89)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)
>
> at
> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)
>
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2622)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)
>
> at org.apache.spark.deploy.SparkSubmit.org
> $apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
>
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>
> at
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
>
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
>
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> ```
>
> On Tue, Feb 14, 2023 at 1:33 AM Ye Xianjin  wrote:
>
>> The configuration of ‘…file.upload.path’ is wrong. it means a distributed
>> fs path to store your archives/resource/jars temporarily, then distributed
>> by spark to drivers/executors.
>> For your cases, you don’t need to set this configuration.
>> Sent from my

Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-14 Thread karan alang

Hi Ye,

This is the error i get when i don't set the
spark.kubernetes.file.upload.path

Any ideas on how to fix this ?

```

Exception in thread "main" org.apache.spark.SparkException: Please specify
spark.kubernetes.file.upload.path property.

at
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:299)

at
org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:248)

at
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)

at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)

at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)

at scala.collection.TraversableLike.map(TraversableLike.scala:238)

at scala.collection.TraversableLike.map$(TraversableLike.scala:231)

at scala.collection.AbstractTraversable.map(Traversable.scala:108)

at
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:247)

at
org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:173)

at scala.collection.immutable.List.foreach(List.scala:392)

at
org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:164)

at
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60)

at
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)

at
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)

at scala.collection.immutable.List.foldLeft(List.scala:89)

at
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)

at
org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106)

at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)

at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)

at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2622)

at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)

at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)

at org.apache.spark.deploy.SparkSubmit.org
$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)

at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)

at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)

at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)

at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
```

On Tue, Feb 14, 2023 at 1:33 AM Ye Xianjin  wrote:

> The configuration of ‘…file.upload.path’ is wrong. it means a distributed
> fs path to store your archives/resource/jars temporarily, then distributed
> by spark to drivers/executors.
> For your cases, you don’t need to set this configuration.
> Sent from my iPhone
>
> On Feb 14, 2023, at 5:43 AM, karan alang  wrote:
>
> 
> Hello All,
>
> I'm trying to run a simple application on GKE (Kubernetes), and it is
> failing:
> Note : I have spark(bitnami spark chart) installed on GKE using helm
> install
>
> Here is what is done :
> 1. created a docker image using Dockerfile
>
> Dockerfile :
> ```
>
> FROM python:3.7-slim
>
> RUN apt-get update && \
> apt-get install -y default-jre && \
> apt-get install -y openjdk-11-jre-headless && \
> apt-get clean
>
> ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64
>
> RUN pip install pyspark
> RUN mkdir -p /myexample && chmod 755 /myexample
> WORKDIR /myexample
>
> COPY src/StructuredStream-on-gke.py /myexample/StructuredStream-on-gke.py
>
> CMD ["pyspark"]
>
> ```
> Simple pyspark application :
> ```
>
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.appName("StructuredStreaming-on-gke").getOrCreate()
>
> data = [('k1', 123000), ('k2', 234000), ('k3', 456000)]
> df = spark.createDataFrame(data, ('id', 'salary'))
>
> df.show(5, False)
>
> ```
>
> Spark-submit command :
> ```
>
> spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode
> cluster --name pyspark-example --conf
> spark.kubernetes.container.image=pyspark-example:0.1 --conf
> spark.kubernetes.file.upload.path=/myexample src/StructuredStream-on-gke.py
> ```
>
> Error i get :
> ```
>
> 23/02/13 13:18:27 INFO KubernetesUtils: Uploading file:
> /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py
> to dest:
> /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a/StructuredStream-on-gke.py...
>
> Exception in

Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-14 Thread Ye Xianjin

The configuration of ‘…file.upload.path’ is wrong. it means a distributed fs path to store your archives/resource/jars temporarily, then distributed by spark to drivers/executors. For your cases, you don’t need to set this configuration.Sent from my iPhoneOn Feb 14, 2023, at 5:43 AM, karan alang  wrote:Hello All,I'm trying to run a simple application on GKE (Kubernetes), and it is failing:Note : I have spark(bitnami spark chart) installed on GKE using helm install  Here is what is done :1. created a docker image using DockerfileDockerfile :```FROM python:3.7-slimRUN apt-get update && \apt-get install -y default-jre && \apt-get install -y openjdk-11-jre-headless && \apt-get cleanENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64RUN pip install pysparkRUN mkdir -p /myexample && chmod 755 /myexampleWORKDIR /myexampleCOPY src/StructuredStream-on-gke.py /myexample/StructuredStream-on-gke.pyCMD ["pyspark"]```Simple pyspark application :```from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("StructuredStreaming-on-gke").getOrCreate()data = "" style="color:rgb(0,128,0);font-weight:bold">'k1', 123000), ('k2', 234000), ('k3', 456000)]df = spark.createDataFrame(data, ('id', 'salary'))df.show(5, False)```Spark-submit command :```





spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode cluster --name pyspark-example --conf spark.kubernetes.container.image=pyspark-example:0.1 --conf spark.kubernetes.file.upload.path=/myexample src/StructuredStream-on-gke.py```Error i get :```





23/02/13 13:18:27 INFO KubernetesUtils: Uploading file: /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py to dest: /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a/StructuredStream-on-gke.py...
Exception in thread "main" org.apache.spark.SparkException: Uploading file /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py failed...
	at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:296)
	at org.apache.spark.deploy.k8s.KubernetesUtils$.renameMainAppResource(KubernetesUtils.scala:270)
	at org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configureForPython(DriverCommandFeatureStep.scala:109)
	at org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configurePod(DriverCommandFeatureStep.scala:44)
	at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:59)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:89)
	at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)
	at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)
	at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2622)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.SparkException: Error uploading file StructuredStream-on-gke.py
	at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:319)
	at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:292)
	... 21 more
Caused by: java.io.IOException: Mkdirs failed to create /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:317)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:414)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387)
	at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2369)
	at org.apache.hadoop.fs.FilterFileSystem.copyFromLocalFile(FilterFileSystem.java:368)
	at

Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-14 Thread Khalid Mammadov

I am not k8s expert but I think you got permission issue. Try 777 as an
example to see if it works.

On Mon, 13 Feb 2023, 21:42 karan alang,  wrote:

> Hello All,
>
> I'm trying to run a simple application on GKE (Kubernetes), and it is
> failing:
> Note : I have spark(bitnami spark chart) installed on GKE using helm
> install
>
> Here is what is done :
> 1. created a docker image using Dockerfile
>
> Dockerfile :
> ```
>
> FROM python:3.7-slim
>
> RUN apt-get update && \
> apt-get install -y default-jre && \
> apt-get install -y openjdk-11-jre-headless && \
> apt-get clean
>
> ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64
>
> RUN pip install pyspark
> RUN mkdir -p /myexample && chmod 755 /myexample
> WORKDIR /myexample
>
> COPY src/StructuredStream-on-gke.py /myexample/StructuredStream-on-gke.py
>
> CMD ["pyspark"]
>
> ```
> Simple pyspark application :
> ```
>
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.appName("StructuredStreaming-on-gke").getOrCreate()
>
> data = [('k1', 123000), ('k2', 234000), ('k3', 456000)]
> df = spark.createDataFrame(data, ('id', 'salary'))
>
> df.show(5, False)
>
> ```
>
> Spark-submit command :
> ```
>
> spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode
> cluster --name pyspark-example --conf
> spark.kubernetes.container.image=pyspark-example:0.1 --conf
> spark.kubernetes.file.upload.path=/myexample src/StructuredStream-on-gke.py
> ```
>
> Error i get :
> ```
>
> 23/02/13 13:18:27 INFO KubernetesUtils: Uploading file:
> /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py
> to dest:
> /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a/StructuredStream-on-gke.py...
>
> Exception in thread "main" org.apache.spark.SparkException: Uploading file
> /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py
> failed...
>
> at
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:296)
>
> at
> org.apache.spark.deploy.k8s.KubernetesUtils$.renameMainAppResource(KubernetesUtils.scala:270)
>
> at
> org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configureForPython(DriverCommandFeatureStep.scala:109)
>
> at
> org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configurePod(DriverCommandFeatureStep.scala:44)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:59)
>
> at
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
>
> at
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
>
> at scala.collection.immutable.List.foldLeft(List.scala:89)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)
>
> at
> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)
>
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2622)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)
>
> at org.apache.spark.deploy.SparkSubmit.org
> $apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
>
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>
> at
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
>
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
>
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> Caused by: org.apache.spark.SparkException: Error uploading file
> StructuredStream-on-gke.py
>
> at
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:319)
>
> at
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:292)
>
> ... 21 more
>
> Caused by: java.io.IOException: Mkdirs failed to create
> /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:317)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305)
>
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
>
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
>
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:414)
>
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387)
>
> at

Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-13 Thread karan alang

Hello All,

I'm trying to run a simple application on GKE (Kubernetes), and it is
failing:
Note : I have spark(bitnami spark chart) installed on GKE using helm
install

Here is what is done :
1. created a docker image using Dockerfile

Dockerfile :
```

FROM python:3.7-slim

RUN apt-get update && \
apt-get install -y default-jre && \
apt-get install -y openjdk-11-jre-headless && \
apt-get clean

ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64

RUN pip install pyspark
RUN mkdir -p /myexample && chmod 755 /myexample
WORKDIR /myexample

COPY src/StructuredStream-on-gke.py /myexample/StructuredStream-on-gke.py

CMD ["pyspark"]

```
Simple pyspark application :
```

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("StructuredStreaming-on-gke").getOrCreate()

data = [('k1', 123000), ('k2', 234000), ('k3', 456000)]
df = spark.createDataFrame(data, ('id', 'salary'))

df.show(5, False)

```

Spark-submit command :
```

spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode cluster
--name pyspark-example --conf
spark.kubernetes.container.image=pyspark-example:0.1 --conf
spark.kubernetes.file.upload.path=/myexample src/StructuredStream-on-gke.py
```

Error i get :
```

23/02/13 13:18:27 INFO KubernetesUtils: Uploading file:
/Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py
to dest:
/myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a/StructuredStream-on-gke.py...

Exception in thread "main" org.apache.spark.SparkException: Uploading file
/Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py
failed...

at
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:296)

at
org.apache.spark.deploy.k8s.KubernetesUtils$.renameMainAppResource(KubernetesUtils.scala:270)

at
org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configureForPython(DriverCommandFeatureStep.scala:109)

at
org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configurePod(DriverCommandFeatureStep.scala:44)

at
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:59)

at
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)

at
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)

at scala.collection.immutable.List.foldLeft(List.scala:89)

at
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)

at
org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106)

at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)

at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)

at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2622)

at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)

at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)

at org.apache.spark.deploy.SparkSubmit.org
$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)

at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)

at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)

at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)

at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: org.apache.spark.SparkException: Error uploading file
StructuredStream-on-gke.py

at
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:319)

at
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:292)

... 21 more

Caused by: java.io.IOException: Mkdirs failed to create
/myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a

at
org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:317)

at
org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305)

at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)

at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)

at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:414)

at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387)

at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2369)

at
org.apache.hadoop.fs.FilterFileSystem.copyFromLocalFile(FilterFileSystem.java:368)

at
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:316)

... 22 more
```

Any ideas on how to fix this & get it to work ?
tia !

Pls see the stackoverflow link :

https://stackoverflow.com/questions/75441360/running-spark-application-on-gke-failing-on-spark-submit

Re: spark on kubernetes

2022-10-16 Thread Qian Sun

Glad to hear it!

On Sun, Oct 16, 2022 at 2:37 PM Mohammad Abdollahzade Arani <
mamadazar...@gmail.com> wrote:

> Hi Qian,
> Thanks for the reply and I'm So sorry for the late reply.
> I found the answer. My mistake was token conversion. I had to decode
> base64  the service accounts token and certificate.
> and you are right I have to use `service account cert` to configure
> spark.kubernetes.authenticate.caCertFile.
> Thanks again. best regards.
>
> On Sat, Oct 15, 2022 at 4:51 PM Qian Sun  wrote:
>
>> Hi Mohammad
>> Did you try this command?
>>
>>
>> ./bin/spark-submit  \ --master k8s://https://vm13:6443 \ --class 
>> com.example.WordCounter  \ --conf 
>> spark.kubernetes.authenticate.driver.serviceAccountName=default  \ 
>> --conf 
>> spark.kubernetes.container.image=private-docker-registery/spark/spark:3.2.1-3
>>  \ --conf spark.kubernetes.namespace=default \ 
>> java-word-count-1.0-SNAPSHOT.jar
>>
>> If you want spark.kubernetes.authenticate.caCertFile, you need to
>> configure it to serviceaccount certFile instead of apiserver certFile.
>>
>> On Sat, Oct 15, 2022 at 8:30 PM Mohammad Abdollahzade Arani
>> mamadazar...@gmail.com  wrote:
>>
>> I have a k8s cluster and a spark cluster.
>>>  my question is is as bellow:
>>>
>>>
>>> https://stackoverflow.com/questions/74053948/how-to-resolve-pods-is-forbidden-user-systemanonymous-cannot-watch-resourc
>>>
>>> I have searched and I found lot's of other similar questions on
>>> stackoverflow without an answer like  bellow:
>>>
>>>
>>> https://stackoverflow.com/questions/61982896/how-to-fix-pods-is-forbidden-user-systemanonymous-cannot-watch-resource
>>>
>>>
>>> --
>>> Best Wishes!
>>> Mohammad Abdollahzade Arani
>>> Computer Engineering @ SBU
>>>
>>> --
>> Best!
>> Qian Sun
>>
>
>
> --
> Best Wishes!
> Mohammad Abdollahzade Arani
> Computer Engineering @ SBU
>
>

-- 
Best!
Qian Sun

Re: spark on kubernetes

2022-10-15 Thread Qian Sun

Hi Mohammad
Did you try this command?


./bin/spark-submit  \ --master k8s://https://vm13:6443 \
--class com.example.WordCounter  \ --conf
spark.kubernetes.authenticate.driver.serviceAccountName=default  \
--conf 
spark.kubernetes.container.image=private-docker-registery/spark/spark:3.2.1-3
\ --conf spark.kubernetes.namespace=default \
java-word-count-1.0-SNAPSHOT.jar

If you want spark.kubernetes.authenticate.caCertFile, you need to configure
it to serviceaccount certFile instead of apiserver certFile.

On Sat, Oct 15, 2022 at 8:30 PM Mohammad Abdollahzade Arani
mamadazar...@gmail.com  wrote:

I have a k8s cluster and a spark cluster.
>  my question is is as bellow:
>
>
> https://stackoverflow.com/questions/74053948/how-to-resolve-pods-is-forbidden-user-systemanonymous-cannot-watch-resourc
>
> I have searched and I found lot's of other similar questions on
> stackoverflow without an answer like  bellow:
>
>
> https://stackoverflow.com/questions/61982896/how-to-fix-pods-is-forbidden-user-systemanonymous-cannot-watch-resource
>
>
> --
> Best Wishes!
> Mohammad Abdollahzade Arani
> Computer Engineering @ SBU
>
> --
Best!
Qian Sun

spark on kubernetes

2022-10-15 Thread Mohammad Abdollahzade Arani

I have a k8s cluster and a spark cluster.
 my question is is as bellow:

https://stackoverflow.com/questions/74053948/how-to-resolve-pods-is-forbidden-user-systemanonymous-cannot-watch-resourc

I have searched and I found lot's of other similar questions on
stackoverflow without an answer like  bellow:

https://stackoverflow.com/questions/61982896/how-to-fix-pods-is-forbidden-user-systemanonymous-cannot-watch-resource


-- 
Best Wishes!
Mohammad Abdollahzade Arani
Computer Engineering @ SBU

trouble using spark in kubernetes

2022-05-03 Thread Andreas Klos


Hello together,

I am trying to run a minimal example in my k8s cluster.

First, I cloned the petastorm github repo: https://github.com/uber/petastorm

Second, I created a Dockerimage as follows:

FROMubuntu:20.04
RUN apt-get update -qq
RUN apt-get install -qq -y software-properties-common
RUN add-apt-repository -yppa:deadsnakes/ppa
RUN apt-get update -qq

RUN apt-get -qq install -y \
  build-essential \
  cmake \
  openjdk-8-jre-headless \
  git \
  python \
  python3-pip \
  python3.9 \
  python3.9-dev \
  python3.9-venv \
  virtualenv \
  wget \
  && rm -rf /var/lib/apt/lists/*
RUN 
wgethttps://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist.bz2
  -P /data/mnist/
RUN mkdir /petastorm
ADD setup.py /petastorm/
ADD README.rst /petastorm/
ADD petastorm /petastorm/petastorm
RUN python3.9 -m pip install pip --upgrade
RUN python3.9 -m pip install wheel
RUN python3.9 -m venv /petastorm_venv3.9
RUN /petastorm_venv3.9/bin/pip3.9 install --no-cache scikit-build
RUN /petastorm_venv3.9/bin/pip3.9 install --no-cache -e 
/petastorm/[test,tf,torch,docs,opencv] --only-binary pyarrow --only-binary 
opencv-python
RUN /petastorm_venv3.9/bin/pip3.9 install -U pyarrow==3.0.0 numpy==1.19.3 
tensorflow==2.5.0 pyspark==3.0.0
RUN /petastorm_venv3.9/bin/pip3.9 install opencv-python-headless
RUN rm -r /petastorm
ADD docker/run_in_venv.sh /

Afterwards, I create a namespace called spark in my k8s cluster, as 
Serviceaccount (spark-driver) and a rolebinding for the service account 
as follows:


kubectl create ns spark
kubectl create serviceaccount spark-driver
kubectl create rolebinding spark-driver-rb --clusterrole=cluster-admin 
--serviceaccount=spark:spark-driver


Finally I create a pod in the spark namespace as follows:

apiVersion: v1
kind: Pod
metadata:
  name: "petastorm-ds-creator"
  namespace: spark
  labels:
    app: "petastorm-ds-creator"
spec:
  serviceAccount: spark-driver
  containers:
  - name: petastorm-ds-creator
    image: "imagename"
    command:
  - "/bin/bash"
  - "-c"
  - "--"
    args:
  - "while true; do sleep 30; done;"
    resources:
  limits:
    cpu: 2000m
    memory: 5000Mi
  requests:
    cpu: 2000m
    memory: 5000Mi
    ports:
    - containerPort:  80
  name:  http
    - containerPort:  443
  name:  https
    - containerPort:  20022
  name:  exposed
    volumeMounts:
    - name: data
  mountPath: /data
  volumes:
    - name: data
  persistentVolumeClaim:
    claimName: spark-geodata-nfs-pvc-20220503
  restartPolicy: Always

I expose port 20022 of the pod with a headless service

kubectl expose pod petastorm-ds-creator --port=20022 --type=ClusterIP 
--cluster-ip=None -n spark


finally I run the following code in the created container/pod:

from pyspark import SparkConf
from pyspark.sql import SparkSession

spark_conf = SparkConf()
spark_conf.setMaster("k8s://https://kubernetes.default:443;)
spark_conf.setAppName("PetastormDsCreator")
spark_conf.set(
    "spark.kubernetes.namespace",
    "spark"
)
spark_conf.set(
    "spark.kubernetes.authenticate.driver.serviceAccountName",
    "spark-driver"
)
spark_conf.set(
    "spark.kubernetes.authenticate.caCertFile",
    "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
)
spark_conf.set(
    "spark.kubernetes.authenticate.oauthTokenFile",
    "/var/run/secrets/kubernetes.io/serviceaccount/token"
)
spark_conf.set(
    "spark.executor.instances",
    "2"
)
spark_conf.set(
    "spark.driver.host",
    "petastorm-ds-creator"
)
spark_conf.set(
    "spark.driver.port",
    "20022"
)
spark_conf.set(
    "spark.kubernetes.container.image",
    "imagename"
)
spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()
sc = spark.sparkContext
t = sc.parallelize(range(10))
r = t.sumApprox(3)
print('Approximate sum: %s' % r)

Unfortunately, It does not work...

with kubectl describe po podname-exec-1 I get the following error message:

Error: failed to start container "spark-kubernetes-executor": Error 
response from daemon: OCI runtime create failed: container_linux.go:349: 
starting container process caused "exec: \"executor\": executable file 
not found in $PATH": unknown


Could somebody give me a hint, what am I doing wrong? Is my SparkSession 
configuration not correct?


Best regards

Andreas

1 2 >

1 - 100 of 143 matches

Mail list logo