Re: Spark on Kubernetes

2024-04-30 Thread Mich Talebzadeh
Hi,
In k8s the driver is responsible for executor creation. The likelihood of
your problem is that Insufficient memory allocated for executors in the K8s
cluster. Even with dynamic allocation, k8s won't  schedule executor pods if
there is not enough free memory to fulfill their resource requests.

My suggestions

   - Increase Executor Memory: Allocate more memory per executor (e.g., 2GB
   or 3GB) to allow for multiple executors within available cluster memory.
   - Adjust Driver Pod Resources: Ensure the driver pod has enough memory
   to run Spark and manage executors.
   - Optimize Resource Management: Explore on-demand allocation or
   adjusting allocation granularity for better resource utilization. For
   example look at documents for Executor On-Demand Allocation
   (spark.executor.cores=0): and spark.dynamicAllocation.minExecutors &
   spark.dynamicAllocation.maxExecutors

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Tue, 30 Apr 2024 at 04:29, Tarun raghav 
wrote:

> Respected Sir/Madam,
> I am Tarunraghav. I have a query regarding spark on kubernetes.
>
> We have an eks cluster, within which we have spark installed in the pods.
> We set the executor memory as 1GB and set the executor instances as 2, I
> have also set dynamic allocation as true. So when I try to read a 3 GB CSV
> file or parquet file, it is supposed to increase the number of pods by 2.
> But the number of executor pods is zero.
> I don't know why executor pods aren't being created, even though I set
> executor instance as 2. Please suggest a solution for this.
>
> Thanks & Regards,
> Tarunraghav
>
>


Spark on Kubernetes

2024-04-29 Thread Tarun raghav
Respected Sir/Madam,
I am Tarunraghav. I have a query regarding spark on kubernetes.

We have an eks cluster, within which we have spark installed in the pods.
We set the executor memory as 1GB and set the executor instances as 2, I
have also set dynamic allocation as true. So when I try to read a 3 GB CSV
file or parquet file, it is supposed to increase the number of pods by 2.
But the number of executor pods is zero.
I don't know why executor pods aren't being created, even though I set
executor instance as 2. Please suggest a solution for this.

Thanks & Regards,
Tarunraghav


Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Thanks for your kind words Sri

Well it is true that as yet spark on kubernetes is not on-par with spark on
YARN in maturity and essentially spark on kubernetes is still work in
progress.* So in the first place IMO one needs to think why executors are
failing. What causes this behaviour? Is it the code or some inadequate
set-up? *These things come to my mind


   - Resource Allocation: Insufficient resources (CPU, memory) can lead to
   executor failures.
   - Mis-configuration Issues: Verify that the configurations are
   appropriate for your workload.
   - External Dependencies: If your Spark job relies on external services
   or data sources, ensure they are accessible. Issues such as network
   problems or unavailability of external services can lead to executor
   failures.
   - Data Skew: Uneven distribution of data across partitions can lead to
   data skew and cause some executors to process significantly more data than
   others. This can lead to resource exhaustion on specific executors.
   - Spark Version and Kubernetes Compatibility: Is Spark running on EKS or
   GKE -- that you are using a Spark version that is compatible with your
   Kubernetes environment. These vendors normally run older, more stable
   versions of Spark. Compatibility issues can arise when using your newer
   version of Spark.
   - How up-to-date are your docker images on container registries (ECR,
   GCR).Is there any incompatibility between docker images built on a Spark
   version and the host spark version you are submitting your spark-submit
   from?

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 19 Feb 2024 at 23:18, Sri Potluri  wrote:

> Dear Mich,
>
> Thank you for your detailed response and the suggested approach to
> handling retry logic. I appreciate you taking the time to outline the
> method of embedding custom retry mechanisms directly into the application
> code.
>
> While the solution of wrapping the main logic of the Spark job in a loop
> for controlling the number of retries is technically sound and offers a
> workaround, it may not be the most efficient or maintainable solution for
> organizations running a large number of Spark jobs. Modifying each
> application to include custom retry logic can be a significant undertaking,
> introducing variability in how retries are handled across different jobs,
> and require additional testing and maintenance.
>
> Ideally, operational concerns like retry behavior in response to
> infrastructure failures should be decoupled from the business logic of
> Spark applications. This separation allows data engineers and scientists to
> focus on the application logic without needing to implement and test
> infrastructure resilience mechanisms.
>
> Thank you again for your time and assistance.
>
> Best regards,
> Sri Potluri
>
> On Mon, Feb 19, 2024 at 5:03 PM Mich Talebzadeh 
> wrote:
>
>> Went through your issue with the code running on k8s
>>
>> When an executor of a Spark application fails, the system attempts to
>> maintain the desired level of parallelism by automatically recreating a new
>> executor to replace the failed one. While this behavior is beneficial for
>> transient errors, ensuring that the application continues to run, it
>> becomes problematic in cases where the failure is due to a persistent issue
>> (such as misconfiguration, inaccessible external resources, or incompatible
>> environment settings). In such scenarios, the application enters a loop,
>> continuously trying to recreate executors, which leads to resource wastage
>> and complicates application management.
>>
>> Well fault tolerance is built especially in k8s cluster. You can
>> implement your own logic to control the retry attempts. You can do this
>> by wrapping the main logic of your Spark job in a loop and controlling the
>> number of retries. If a persistent issue is detected, you can choose to
>> stop the job. Today is the third time that looping control has come up :)
>>
>> Take this code
>>
>> import time
>> max_retries = 5 retries = 0 while retries < max_retries: try: # Your
>> Spark job logic here except Exception as e: # Log the exception
>> print(f"Exception in Spark job:

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Cheng Pan
Spark has supported the window-based executor failure-tracking mechanism for 
YARN for a long time, SPARK-41210[1][2] (included in 3.5.0) extended this 
feature to K8s.

[1] https://issues.apache.org/jira/browse/SPARK-41210
[2] https://github.com/apache/spark/pull/38732

Thanks,
Cheng Pan


> On Feb 19, 2024, at 23:59, Sri Potluri  wrote:
> 
> Hello Spark Community,
> 
> I am currently leveraging Spark on Kubernetes, managed by the Spark Operator, 
> for running various Spark applications. While the system generally works 
> well, I've encountered a challenge related to how Spark applications handle 
> executor failures, specifically in scenarios where executors enter an error 
> state due to persistent issues.
> 
> Problem Description
> 
> When an executor of a Spark application fails, the system attempts to 
> maintain the desired level of parallelism by automatically recreating a new 
> executor to replace the failed one. While this behavior is beneficial for 
> transient errors, ensuring that the application continues to run, it becomes 
> problematic in cases where the failure is due to a persistent issue (such as 
> misconfiguration, inaccessible external resources, or incompatible 
> environment settings). In such scenarios, the application enters a loop, 
> continuously trying to recreate executors, which leads to resource wastage 
> and complicates application management.
> 
> Desired Behavior
> 
> Ideally, I would like to have a mechanism to limit the number of retries for 
> executor recreation. If the system fails to successfully create an executor 
> more than a specified number of times (e.g., 5 attempts), the entire Spark 
> application should fail and stop trying to recreate the executor. This 
> behavior would help in efficiently managing resources and avoiding prolonged 
> failure states.
> 
> Questions for the Community
> 
> 1. Is there an existing configuration or method within Spark or the Spark 
> Operator to limit executor recreation attempts and fail the job after 
> reaching a threshold?
>
> 2. Has anyone else encountered similar challenges and found workarounds or 
> solutions that could be applied in this context?
> 
> 
> Additional Context
> 
> I have explored Spark's task and stage retry configurations 
> (`spark.task.maxFailures`, `spark.stage.maxConsecutiveAttempts`), but these 
> do not directly address the issue of limiting executor creation retries. 
> Implementing a custom monitoring solution to track executor failures and 
> manually stop the application is a potential workaround, but it would be 
> preferable to have a more integrated solution.
> 
> I appreciate any guidance, insights, or feedback you can provide on this 
> matter.
> 
> Thank you for your time and support.
> 
> Best regards,
> Sri P


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Sri Potluri
Dear Mich,

Thank you for your detailed response and the suggested approach to handling
retry logic. I appreciate you taking the time to outline the method of
embedding custom retry mechanisms directly into the application code.

While the solution of wrapping the main logic of the Spark job in a loop
for controlling the number of retries is technically sound and offers a
workaround, it may not be the most efficient or maintainable solution for
organizations running a large number of Spark jobs. Modifying each
application to include custom retry logic can be a significant undertaking,
introducing variability in how retries are handled across different jobs,
and require additional testing and maintenance.

Ideally, operational concerns like retry behavior in response to
infrastructure failures should be decoupled from the business logic of
Spark applications. This separation allows data engineers and scientists to
focus on the application logic without needing to implement and test
infrastructure resilience mechanisms.

Thank you again for your time and assistance.

Best regards,
Sri Potluri

On Mon, Feb 19, 2024 at 5:03 PM Mich Talebzadeh 
wrote:

> Went through your issue with the code running on k8s
>
> When an executor of a Spark application fails, the system attempts to
> maintain the desired level of parallelism by automatically recreating a new
> executor to replace the failed one. While this behavior is beneficial for
> transient errors, ensuring that the application continues to run, it
> becomes problematic in cases where the failure is due to a persistent issue
> (such as misconfiguration, inaccessible external resources, or incompatible
> environment settings). In such scenarios, the application enters a loop,
> continuously trying to recreate executors, which leads to resource wastage
> and complicates application management.
>
> Well fault tolerance is built especially in k8s cluster. You can implement 
> your
> own logic to control the retry attempts. You can do this by wrapping the
> main logic of your Spark job in a loop and controlling the number of
> retries. If a persistent issue is detected, you can choose to stop the job.
> Today is the third time that looping control has come up :)
>
> Take this code
>
> import time
> max_retries = 5 retries = 0 while retries < max_retries: try: # Your Spark
> job logic here except Exception as e: # Log the exception print(f"Exception
> in Spark job: {str(e)}") # Increment the retry count retries += 1 # Sleep
> time.sleep(60) else: # Break out of the loop if the job completes
> successfully break
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 19 Feb 2024 at 19:21, Mich Talebzadeh 
> wrote:
>
>> Not that I am aware of any configuration parameter in Spark classic to
>> limit executor creation. Because of fault tolerance Spark will try to
>> recreate failed executors. Not really that familiar with the Spark operator
>> for k8s. There may be something there.
>>
>> Have you considered custom monitoring and handling within Spark itself
>> using max_retries = 5  etc?
>>
>> HTH
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Mon, 19 Feb 2024 at 18:34, Sri Potluri  wrote:
>>
>>> Hello Spark Community,
>>>
>>> I am currently leveraging Spark on Kubernetes, managed by the Spark
>>> Operator, for running various

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Went through your issue with the code running on k8s

When an executor of a Spark application fails, the system attempts to
maintain the desired level of parallelism by automatically recreating a new
executor to replace the failed one. While this behavior is beneficial for
transient errors, ensuring that the application continues to run, it
becomes problematic in cases where the failure is due to a persistent issue
(such as misconfiguration, inaccessible external resources, or incompatible
environment settings). In such scenarios, the application enters a loop,
continuously trying to recreate executors, which leads to resource wastage
and complicates application management.

Well fault tolerance is built especially in k8s cluster. You can implement your
own logic to control the retry attempts. You can do this by wrapping the
main logic of your Spark job in a loop and controlling the number of
retries. If a persistent issue is detected, you can choose to stop the job.
Today is the third time that looping control has come up :)

Take this code

import time
max_retries = 5 retries = 0 while retries < max_retries: try: # Your Spark
job logic here except Exception as e: # Log the exception print(f"Exception
in Spark job: {str(e)}") # Increment the retry count retries += 1 # Sleep
time.sleep(60) else: # Break out of the loop if the job completes
successfully break

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 19 Feb 2024 at 19:21, Mich Talebzadeh 
wrote:

> Not that I am aware of any configuration parameter in Spark classic to
> limit executor creation. Because of fault tolerance Spark will try to
> recreate failed executors. Not really that familiar with the Spark operator
> for k8s. There may be something there.
>
> Have you considered custom monitoring and handling within Spark itself
> using max_retries = 5  etc?
>
> HTH
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 19 Feb 2024 at 18:34, Sri Potluri  wrote:
>
>> Hello Spark Community,
>>
>> I am currently leveraging Spark on Kubernetes, managed by the Spark
>> Operator, for running various Spark applications. While the system
>> generally works well, I've encountered a challenge related to how Spark
>> applications handle executor failures, specifically in scenarios where
>> executors enter an error state due to persistent issues.
>>
>> *Problem Description*
>>
>> When an executor of a Spark application fails, the system attempts to
>> maintain the desired level of parallelism by automatically recreating a new
>> executor to replace the failed one. While this behavior is beneficial for
>> transient errors, ensuring that the application continues to run, it
>> becomes problematic in cases where the failure is due to a persistent issue
>> (such as misconfiguration, inaccessible external resources, or incompatible
>> environment settings). In such scenarios, the application enters a loop,
>> continuously trying to recreate executors, which leads to resource wastage
>> and complicates application management.
>>
>> *Desired Behavior*
>>
>> Ideally, I would like to have a mechanism to limit the number of retries
>> for executor recreation. If the system fails to successfully create an
>> executor more than a specified number of times (e.g., 5 attempts), the
>> entire Spark application should fail and stop trying to recreate the
>> executor. This behavior would help in efficiently managing resources and
>> avoiding prolonged failure states.
>>
>> *Questions for the Community*
>>
>> 1. 

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Not that I am aware of any configuration parameter in Spark classic to
limit executor creation. Because of fault tolerance Spark will try to
recreate failed executors. Not really that familiar with the Spark operator
for k8s. There may be something there.

Have you considered custom monitoring and handling within Spark itself
using max_retries = 5  etc?

HTH

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 19 Feb 2024 at 18:34, Sri Potluri  wrote:

> Hello Spark Community,
>
> I am currently leveraging Spark on Kubernetes, managed by the Spark
> Operator, for running various Spark applications. While the system
> generally works well, I've encountered a challenge related to how Spark
> applications handle executor failures, specifically in scenarios where
> executors enter an error state due to persistent issues.
>
> *Problem Description*
>
> When an executor of a Spark application fails, the system attempts to
> maintain the desired level of parallelism by automatically recreating a new
> executor to replace the failed one. While this behavior is beneficial for
> transient errors, ensuring that the application continues to run, it
> becomes problematic in cases where the failure is due to a persistent issue
> (such as misconfiguration, inaccessible external resources, or incompatible
> environment settings). In such scenarios, the application enters a loop,
> continuously trying to recreate executors, which leads to resource wastage
> and complicates application management.
>
> *Desired Behavior*
>
> Ideally, I would like to have a mechanism to limit the number of retries
> for executor recreation. If the system fails to successfully create an
> executor more than a specified number of times (e.g., 5 attempts), the
> entire Spark application should fail and stop trying to recreate the
> executor. This behavior would help in efficiently managing resources and
> avoiding prolonged failure states.
>
> *Questions for the Community*
>
> 1. Is there an existing configuration or method within Spark or the Spark
> Operator to limit executor recreation attempts and fail the job after
> reaching a threshold?
>
> 2. Has anyone else encountered similar challenges and found workarounds or
> solutions that could be applied in this context?
>
>
> *Additional Context*
>
> I have explored Spark's task and stage retry configurations
> (`spark.task.maxFailures`, `spark.stage.maxConsecutiveAttempts`), but these
> do not directly address the issue of limiting executor creation retries.
> Implementing a custom monitoring solution to track executor failures and
> manually stop the application is a potential workaround, but it would be
> preferable to have a more integrated solution.
>
> I appreciate any guidance, insights, or feedback you can provide on this
> matter.
>
> Thank you for your time and support.
>
> Best regards,
> Sri P
>


[Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Sri Potluri
Hello Spark Community,

I am currently leveraging Spark on Kubernetes, managed by the Spark
Operator, for running various Spark applications. While the system
generally works well, I've encountered a challenge related to how Spark
applications handle executor failures, specifically in scenarios where
executors enter an error state due to persistent issues.

*Problem Description*

When an executor of a Spark application fails, the system attempts to
maintain the desired level of parallelism by automatically recreating a new
executor to replace the failed one. While this behavior is beneficial for
transient errors, ensuring that the application continues to run, it
becomes problematic in cases where the failure is due to a persistent issue
(such as misconfiguration, inaccessible external resources, or incompatible
environment settings). In such scenarios, the application enters a loop,
continuously trying to recreate executors, which leads to resource wastage
and complicates application management.

*Desired Behavior*

Ideally, I would like to have a mechanism to limit the number of retries
for executor recreation. If the system fails to successfully create an
executor more than a specified number of times (e.g., 5 attempts), the
entire Spark application should fail and stop trying to recreate the
executor. This behavior would help in efficiently managing resources and
avoiding prolonged failure states.

*Questions for the Community*

1. Is there an existing configuration or method within Spark or the Spark
Operator to limit executor recreation attempts and fail the job after
reaching a threshold?

2. Has anyone else encountered similar challenges and found workarounds or
solutions that could be applied in this context?


*Additional Context*

I have explored Spark's task and stage retry configurations
(`spark.task.maxFailures`, `spark.stage.maxConsecutiveAttempts`), but these
do not directly address the issue of limiting executor creation retries.
Implementing a custom monitoring solution to track executor failures and
manually stop the application is a potential workaround, but it would be
preferable to have a more integrated solution.

I appreciate any guidance, insights, or feedback you can provide on this
matter.

Thank you for your time and support.

Best regards,
Sri P


Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Jagannath Majhi
Yes I have gone through it. So let's give me the setup. More context - My
jar file is in java language

On Mon, Feb 19, 2024, 8:53 PM Mich Talebzadeh 
wrote:

> Sure but first it would be beneficial to understand the way Spark works on
> Kubernetes and the concept.s
>
> Have a look at this article of mine
>
> Spark on Kubernetes, A Practitioner’s Guide
> <https://www.linkedin.com/pulse/spark-kubernetes-practitioners-guide-mich-talebzadeh-ph-d-%3FtrackingId=Wsu3lkoPaCWqGemYHe8%252BLQ%253D%253D/?trackingId=Wsu3lkoPaCWqGemYHe8%2BLQ%3D%3D>
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 19 Feb 2024 at 15:09, Jagannath Majhi <
> jagannath.ma...@cloud.cbnits.com> wrote:
>
>> Yes
>>
>> On Mon, Feb 19, 2024, 8:35 PM Mich Talebzadeh 
>> wrote:
>>
>>> OK you have a jar file that you want to work with when running using
>>> Spark on k8s as the execution engine (EKS) as opposed to  YARN on EMR as
>>> the execution engine?
>>>
>>>
>>> Mich Talebzadeh,
>>> Dad | Technologist | Solutions Architect | Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>
>>>
>>> On Mon, 19 Feb 2024 at 14:38, Jagannath Majhi <
>>> jagannath.ma...@cloud.cbnits.com> wrote:
>>>
>>>> I am not using any private docker image. Only I am running the jar file
>>>> in EMR using spark-submit command so now I want to run this jar file in eks
>>>> so can you please tell me how can I set-up for this ??
>>>>
>>>> On Mon, Feb 19, 2024, 8:06 PM Jagannath Majhi <
>>>> jagannath.ma...@cloud.cbnits.com> wrote:
>>>>
>>>>> Can we connect over Google meet??
>>>>>
>>>>> On Mon, Feb 19, 2024, 8:03 PM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Where is your docker file? In ECR container registry.
>>>>>> If you are going to use EKS, then it need to be accessible to all
>>>>>> nodes of cluster
>>>>>>
>>>>>> When you build your docker image, put your jar under the $SPARK_HOME
>>>>>> directory. Then add a line to your docker build file as below
>>>>>> Here I am accessing Google BigQuery DW from EKS
>>>>>> # Add a BigQuery connector jar.
>>>>>> ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/
>>>>>> ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'
>>>>>> RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}" \
>>>>>> && chown spark:spark "${SPARK_EXTRA_JARS_DIR}"
>>>>>> COPY --chown=spark:spark \
>>>>>> spark-bigquery-with-dependencies_2.12-0.22.2.jar
>>>>>> "${SPARK_EXTRA_JARS_DIR}"
>>>>>>
>>>>>> Here I am accessing Google BigQuery DW from EKS cluster
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Dad | Technologist | Solutions Architect | Engineer
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Jagannath Majhi
I am not using any private docker image. Only I am running the jar file in
EMR using spark-submit command so now I want to run this jar file in eks so
can you please tell me how can I set-up for this ??

On Mon, Feb 19, 2024, 8:06 PM Jagannath Majhi <
jagannath.ma...@cloud.cbnits.com> wrote:

> Can we connect over Google meet??
>
> On Mon, Feb 19, 2024, 8:03 PM Mich Talebzadeh 
> wrote:
>
>> Where is your docker file? In ECR container registry.
>> If you are going to use EKS, then it need to be accessible to all nodes
>> of cluster
>>
>> When you build your docker image, put your jar under the $SPARK_HOME
>> directory. Then add a line to your docker build file as below
>> Here I am accessing Google BigQuery DW from EKS
>> # Add a BigQuery connector jar.
>> ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/
>> ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'
>> RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}" \
>> && chown spark:spark "${SPARK_EXTRA_JARS_DIR}"
>> COPY --chown=spark:spark \
>> spark-bigquery-with-dependencies_2.12-0.22.2.jar
>> "${SPARK_EXTRA_JARS_DIR}"
>>
>> Here I am accessing Google BigQuery DW from EKS cluster
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 19 Feb 2024 at 13:42, Jagannath Majhi <
>> jagannath.ma...@cloud.cbnits.com> wrote:
>>
>>> Dear Spark Community,
>>>
>>> I hope this email finds you well. I am reaching out to seek assistance
>>> and guidance regarding a task I'm currently working on involving Apache
>>> Spark.
>>>
>>> I have developed a JAR file that contains some Spark applications and
>>> functionality, and I need to run this JAR file within a Spark cluster.
>>> However, the JAR file is located in an AWS S3 bucket. I'm facing some
>>> challenges in configuring Spark to access and execute this JAR file
>>> directly from the S3 bucket.
>>>
>>> I would greatly appreciate any advice, best practices, or pointers on
>>> how to achieve this integration effectively. Specifically, I'm looking for
>>> insights on:
>>>
>>>1. Configuring Spark to access and retrieve the JAR file from an AWS
>>>S3 bucket.
>>>2. Setting up the necessary permissions and authentication
>>>mechanisms to ensure seamless access to the S3 bucket.
>>>3. Any potential performance considerations or optimizations when
>>>running Spark applications with dependencies stored in remote storage 
>>> like
>>>AWS S3.
>>>
>>> If anyone in the community has prior experience or knowledge in this
>>> area, I would be extremely grateful for your guidance. Additionally, if
>>> there are any relevant resources, documentation, or tutorials that you
>>> could recommend, it would be incredibly helpful.
>>>
>>> Thank you very much for considering my request. I look forward to
>>> hearing from you and benefiting from the collective expertise of the Spark
>>> community.
>>>
>>> Best regards, Jagannath Majhi
>>>
>>


Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Mich Talebzadeh
Sure but first it would be beneficial to understand the way Spark works on
Kubernetes and the concept.s

Have a look at this article of mine

Spark on Kubernetes, A Practitioner’s Guide
<https://www.linkedin.com/pulse/spark-kubernetes-practitioners-guide-mich-talebzadeh-ph-d-%3FtrackingId=Wsu3lkoPaCWqGemYHe8%252BLQ%253D%253D/?trackingId=Wsu3lkoPaCWqGemYHe8%2BLQ%3D%3D>

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 19 Feb 2024 at 15:09, Jagannath Majhi <
jagannath.ma...@cloud.cbnits.com> wrote:

> Yes
>
> On Mon, Feb 19, 2024, 8:35 PM Mich Talebzadeh 
> wrote:
>
>> OK you have a jar file that you want to work with when running using
>> Spark on k8s as the execution engine (EKS) as opposed to  YARN on EMR as
>> the execution engine?
>>
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Mon, 19 Feb 2024 at 14:38, Jagannath Majhi <
>> jagannath.ma...@cloud.cbnits.com> wrote:
>>
>>> I am not using any private docker image. Only I am running the jar file
>>> in EMR using spark-submit command so now I want to run this jar file in eks
>>> so can you please tell me how can I set-up for this ??
>>>
>>> On Mon, Feb 19, 2024, 8:06 PM Jagannath Majhi <
>>> jagannath.ma...@cloud.cbnits.com> wrote:
>>>
>>>> Can we connect over Google meet??
>>>>
>>>> On Mon, Feb 19, 2024, 8:03 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Where is your docker file? In ECR container registry.
>>>>> If you are going to use EKS, then it need to be accessible to all
>>>>> nodes of cluster
>>>>>
>>>>> When you build your docker image, put your jar under the $SPARK_HOME
>>>>> directory. Then add a line to your docker build file as below
>>>>> Here I am accessing Google BigQuery DW from EKS
>>>>> # Add a BigQuery connector jar.
>>>>> ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/
>>>>> ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'
>>>>> RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}" \
>>>>> && chown spark:spark "${SPARK_EXTRA_JARS_DIR}"
>>>>> COPY --chown=spark:spark \
>>>>> spark-bigquery-with-dependencies_2.12-0.22.2.jar
>>>>> "${SPARK_EXTRA_JARS_DIR}"
>>>>>
>>>>> Here I am accessing Google BigQuery DW from EKS cluster
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Dad | Technologist | Solutions Architect | Engineer
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>> expert opinions (Werner
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>

Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Mich Talebzadeh
OK you have a jar file that you want to work with when running using Spark
on k8s as the execution engine (EKS) as opposed to  YARN on EMR as the
execution engine?


Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 19 Feb 2024 at 14:38, Jagannath Majhi <
jagannath.ma...@cloud.cbnits.com> wrote:

> I am not using any private docker image. Only I am running the jar file in
> EMR using spark-submit command so now I want to run this jar file in eks so
> can you please tell me how can I set-up for this ??
>
> On Mon, Feb 19, 2024, 8:06 PM Jagannath Majhi <
> jagannath.ma...@cloud.cbnits.com> wrote:
>
>> Can we connect over Google meet??
>>
>> On Mon, Feb 19, 2024, 8:03 PM Mich Talebzadeh 
>> wrote:
>>
>>> Where is your docker file? In ECR container registry.
>>> If you are going to use EKS, then it need to be accessible to all nodes
>>> of cluster
>>>
>>> When you build your docker image, put your jar under the $SPARK_HOME
>>> directory. Then add a line to your docker build file as below
>>> Here I am accessing Google BigQuery DW from EKS
>>> # Add a BigQuery connector jar.
>>> ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/
>>> ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'
>>> RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}" \
>>> && chown spark:spark "${SPARK_EXTRA_JARS_DIR}"
>>> COPY --chown=spark:spark \
>>> spark-bigquery-with-dependencies_2.12-0.22.2.jar
>>> "${SPARK_EXTRA_JARS_DIR}"
>>>
>>> Here I am accessing Google BigQuery DW from EKS cluster
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Dad | Technologist | Solutions Architect | Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Mon, 19 Feb 2024 at 13:42, Jagannath Majhi <
>>> jagannath.ma...@cloud.cbnits.com> wrote:
>>>
 Dear Spark Community,

 I hope this email finds you well. I am reaching out to seek assistance
 and guidance regarding a task I'm currently working on involving Apache
 Spark.

 I have developed a JAR file that contains some Spark applications and
 functionality, and I need to run this JAR file within a Spark cluster.
 However, the JAR file is located in an AWS S3 bucket. I'm facing some
 challenges in configuring Spark to access and execute this JAR file
 directly from the S3 bucket.

 I would greatly appreciate any advice, best practices, or pointers on
 how to achieve this integration effectively. Specifically, I'm looking for
 insights on:

1. Configuring Spark to access and retrieve the JAR file from an
AWS S3 bucket.
2. Setting up the necessary permissions and authentication
mechanisms to ensure seamless access to the S3 bucket.
3. Any potential performance considerations or optimizations when
running Spark applications with dependencies stored in remote storage 
 like
AWS S3.

 If anyone in the community has prior experience or knowledge in this
 area, I would be extremely grateful for your guidance. Additionally, if
 there are any relevant resources, documentation, or tutorials that you
 could recommend, it would be incredibly helpful.

 Thank you very much for considering my request. I look forward to
 hearing from you and benefiting from the collective expertise of the Spark
 community.

 Best regards, Jagannath Majhi

>>>


Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Mich Talebzadeh
Where is your docker file? In ECR container registry.
If you are going to use EKS, then it need to be accessible to all nodes of
cluster

When you build your docker image, put your jar under the $SPARK_HOME
directory. Then add a line to your docker build file as below
Here I am accessing Google BigQuery DW from EKS
# Add a BigQuery connector jar.
ENV SPARK_EXTRA_JARS_DIR=/opt/spark/jars/
ENV SPARK_EXTRA_CLASSPATH='/opt/spark/jars/*'
RUN mkdir -p "${SPARK_EXTRA_JARS_DIR}" \
&& chown spark:spark "${SPARK_EXTRA_JARS_DIR}"
COPY --chown=spark:spark \
spark-bigquery-with-dependencies_2.12-0.22.2.jar
"${SPARK_EXTRA_JARS_DIR}"

Here I am accessing Google BigQuery DW from EKS cluster

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 19 Feb 2024 at 13:42, Jagannath Majhi <
jagannath.ma...@cloud.cbnits.com> wrote:

> Dear Spark Community,
>
> I hope this email finds you well. I am reaching out to seek assistance and
> guidance regarding a task I'm currently working on involving Apache Spark.
>
> I have developed a JAR file that contains some Spark applications and
> functionality, and I need to run this JAR file within a Spark cluster.
> However, the JAR file is located in an AWS S3 bucket. I'm facing some
> challenges in configuring Spark to access and execute this JAR file
> directly from the S3 bucket.
>
> I would greatly appreciate any advice, best practices, or pointers on how
> to achieve this integration effectively. Specifically, I'm looking for
> insights on:
>
>1. Configuring Spark to access and retrieve the JAR file from an AWS
>S3 bucket.
>2. Setting up the necessary permissions and authentication mechanisms
>to ensure seamless access to the S3 bucket.
>3. Any potential performance considerations or optimizations when
>running Spark applications with dependencies stored in remote storage like
>AWS S3.
>
> If anyone in the community has prior experience or knowledge in this area,
> I would be extremely grateful for your guidance. Additionally, if there are
> any relevant resources, documentation, or tutorials that you could
> recommend, it would be incredibly helpful.
>
> Thank you very much for considering my request. I look forward to hearing
> from you and benefiting from the collective expertise of the Spark
> community.
>
> Best regards, Jagannath Majhi
>


Re: Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Richard Smith

I run my Spark jobs in GCP with Google Dataproc using GCS buckets.

I've not used AWS, but its EMR product offers similar functionality to 
Dataproc. The title of your post implies your Spark cluster runs on EKS. 
You might be better off using EMR, see links below:


EMR 
https://medium.com/big-data-on-amazon-elastic-mapreduce/run-a-spark-job-within-amazon-emr-in-15-minutes-68b02af1ae16


EKS https://medium.com/@vikas.navlani/running-spark-on-aws-eks-1cd4c31786c

Richard

On 19/02/2024 13:36, Jagannath Majhi wrote:


Dear Spark Community,

I hope this email finds you well. I am reaching out to seek assistance 
and guidance regarding a task I'm currently working on involving 
Apache Spark.


I have developed a JAR file that contains some Spark applications and 
functionality, and I need to run this JAR file within a Spark cluster. 
However, the JAR file is located in an AWS S3 bucket. I'm facing some 
challenges in configuring Spark to access and execute this JAR file 
directly from the S3 bucket.


I would greatly appreciate any advice, best practices, or pointers on 
how to achieve this integration effectively. Specifically, I'm looking 
for insights on:


 1. Configuring Spark to access and retrieve the JAR file from an AWS
S3 bucket.
 2. Setting up the necessary permissions and authentication mechanisms
to ensure seamless access to the S3 bucket.
 3. Any potential performance considerations or optimizations when
running Spark applications with dependencies stored in remote
storage like AWS S3.

If anyone in the community has prior experience or knowledge in this 
area, I would be extremely grateful for your guidance. Additionally, 
if there are any relevant resources, documentation, or tutorials that 
you could recommend, it would be incredibly helpful.


Thank you very much for considering my request. I look forward to 
hearing from you and benefiting from the collective expertise of the 
Spark community.


Best regards, Jagannath Majhi


Regarding Spark on Kubernetes(EKS)

2024-02-19 Thread Jagannath Majhi
Dear Spark Community,

I hope this email finds you well. I am reaching out to seek assistance and
guidance regarding a task I'm currently working on involving Apache Spark.

I have developed a JAR file that contains some Spark applications and
functionality, and I need to run this JAR file within a Spark cluster.
However, the JAR file is located in an AWS S3 bucket. I'm facing some
challenges in configuring Spark to access and execute this JAR file
directly from the S3 bucket.

I would greatly appreciate any advice, best practices, or pointers on how
to achieve this integration effectively. Specifically, I'm looking for
insights on:

   1. Configuring Spark to access and retrieve the JAR file from an AWS S3
   bucket.
   2. Setting up the necessary permissions and authentication mechanisms to
   ensure seamless access to the S3 bucket.
   3. Any potential performance considerations or optimizations when
   running Spark applications with dependencies stored in remote storage like
   AWS S3.

If anyone in the community has prior experience or knowledge in this area,
I would be extremely grateful for your guidance. Additionally, if there are
any relevant resources, documentation, or tutorials that you could
recommend, it would be incredibly helpful.

Thank you very much for considering my request. I look forward to hearing
from you and benefiting from the collective expertise of the Spark
community.

Best regards, Jagannath Majhi


Elasticity and scalability for Spark in Kubernetes

2023-10-30 Thread Mich Talebzadeh
I  was thinking in line of elasticity and autoscaling for Spark in the
context of Kubernetes. My experience with Kubernetes and Spark on the so
called autopilot has not been that great.This is mainly from my experience
that in autopilot you let the choice of nodes be decided by the vendor's
default configuration. Autopilot assumes that you can scale horizontally if
resource allocation is not there. However, this does not take into account,
if you start a k8s node of 4GB which is totally inadequate for a spark job
with moderate loads. Simply the driver pod fails to create and autopilot
starts building the cluster again. causing the delay. Sure it can start
with a larger node size and it might get there eventually at a considerable
delay.

Vertical elasticity refers to the ability of a single application instance
to scale its resources up or down. This can be done by adjusting the amount
of memory, CPU, or storage allocated to the application.

Horizontal autoscaling refers to the ability to automatically add or remove
application instances based on the workload. This is typically done by
monitoring the application's performance metrics, such as CPU utilization,
memory usage, or request latency.

Vertical elasticity


   - Memory: The amount of memory allocated to each Spark executor.
   - CPU: The number of CPU cores allocated to each Spark executor.
   - Storage: The amount of storage allocated to each Spark executor.


Horizontal autoscaling


   - Minimum number of executors: The minimum number of executors that
   should be running at any given time.
   - Maximum number of executors: The maximum number of executors that can
   be running at any given time.
   - Target CPU utilization: The desired CPU utilization for the cluster.
   - Target memory utilization: The desired memory utilization for the
   cluster.
   - Target request latency: The desired request latency for the
   application.


For example, in Python I would have these:


# Setting the horizontal autoscaling parameters

spark.conf.set('spark.dynamicAllocation.enabled', 'true') spark.conf.set(
'spark.dynamicAllocation.minExecutors', min_instances) spark.conf.set(
'spark.dynamicAllocation.maxExecutors', max_instances) spark.conf.set(
'spark.dynamicAllocation.targetExecutorIdleTime', 30) spark.conf.set(
'spark.dynamicAllocation.initialExecutors', 4)
spark.conf.set('spark.dynamicAllocation.targetRequestLatency', 100)

I have have also set the following properties, which are not strictly
necessary for horizontal autoscaling, but which can be helpful

   - target_memory_utilization: This property specifies the desired memory
   utilization for the application cluster.
   - target_request_latency: This property specifies the desired request
   latency for the application cluster.


spark.conf.set('target_request_latency '.100)
spark.conf.set('target_memory_utilization', 60)
Anyway this a sample of parameters that I use in spark-submit

spark-submit --verbose \
   --properties-file ${property_file} \
   --master k8s://https://$KUBERNETES_MASTER_IP:443 \
   --deploy-mode cluster \
   --name $APPNAME \
   --py-files $CODE_DIRECTORY_CLOUD/spark_on_gke.zip \
   --conf spark.kubernetes.namespace=$NAMESPACE \
   --conf spark.network.timeout=300 \
   --conf spark.kubernetes.allocation.batch.size=3 \
   --conf spark.kubernetes.allocation.batch.delay=1 \
   --conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \
   --conf spark.kubernetes.executor.container.image=${IMAGEDRIVER} \
   --conf spark.kubernetes.driver.pod.name=$APPNAME \
   --conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \
   --conf
spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
   --conf
spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
\
   --conf spark.dynamicAllocation.enabled=true \
   --conf spark.dynamicAllocation.shuffleTracking.enabled=true \
   --conf spark.dynamicAllocation.shuffleTracking.timeout=20s \
   --conf spark.dynamicAllocation.executorIdleTimeout=30s \
   --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=40s \
   --conf spark.dynamicAllocation.minExecutors=0 \
   --conf spark.dynamicAllocation.maxExecutors=20 \
   --conf spark.driver.cores=3 \
   --conf spark.executor.cores=3 \
   --conf spark.driver.memory=1024m \
   --conf spark.executor.memory=1024m \
   $CODE_DIRECTORY_CLOUD/${APPLICATION}

Note that I have kept the memory low (both the driver and executor) to move
the submit job from Pending to Running state. This is by no means optimum
but I like to explore ideas on it.
with the other members.

Thanks

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom

   view my Linkedin profile

 http

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jon Rodríguez Aranguren
Dear Jörn Franke, Jayabindu Singh and Spark Community members,

Thank you profoundly for your initial insights. I feel it's necessary to
provide more precision on our setup to facilitate a deeper understanding.

We're interfacing with S3 Compatible storages, but our operational context
is somewhat distinct. Our infrastructure doesn't lean on conventional cloud
providers like AWS. Instead, we've architected our environment on
On-Premise Kubernetes distributions, specifically k0s and Openshift.

Our objective extends beyond just handling S3 keys. We're orchestrating a
solution that integrates Azure SPNs, API Credentials, and other sensitive
credentials, intending to make Kubernetes' native secrets our central
management hub. The aspiration is to have a universally deployable JAR, one
that can function unmodified across different ecosystems like EMR,
Databricks (on both AWS and Azure), etc. Platforms like Databricks have
already made strides in this direction, allowing secrets to be woven
directly into the Spark Conf through mechanisms like
{{secret_scope/secret_name}}, which are resolved dynamically.

The spark-on-k8s-operator's user guide suggests the feasibility of mounting
secrets. However, a gap exists in our understanding of how to subsequently
access these mounted secret values within the Spark application's context.

Here lies my inquiry: is the spark-on-k8s-operator currently equipped to
support this level of integration? If it does, any elucidation on the
method or best practices would be pivotal for our project. Alternatively,
if you could point me to resources or community experts who have tackled
similar challenges, it would be of immense assistance.

Thank you for bearing with the intricacies of our query, and I appreciate
your continued guidance in this endeavor.

Warm regards,

Jon Rodríguez Aranguren.

El sáb, 30 sept 2023 a las 23:19, Jayabindu Singh ()
escribió:

> Hi Jon,
>
> Using IAM as suggested by Jorn is the best approach.
> We recently moved our spark workload from HDP to Spark on K8 and utilizing
> IAM.
> It will save you from secret management headaches and also allows a lot
> more flexibility on access control and option to allow access to multiple
> S3 buckets in the same pod.
> We have implemented this across Azure, Google and AWS. Azure does require
> some extra work to make it work.
>
> On Sat, Sep 30, 2023 at 12:05 PM Jörn Franke  wrote:
>
>> Don’t use static iam (s3) credentials. It is an outdated insecure method
>> - even AWS recommend against using this for anything (cf eg
>> https://docs.aws.amazon.com/cli/latest/userguide/cli-authentication-user.html
>> ).
>> It is almost a guarantee to get your data stolen and your account
>> manipulated.
>>
>> If you need to use kubernetes (which has its own very problematic
>> security issues) then assign AWS IAM roles with minimal permissions to the
>> pods (for EKS it means using OIDC, cf
>> https://docs.aws.amazon.com/eks/latest/userguide/service_IAM_role.html).
>>
>> Am 30.09.2023 um 03:41 schrieb Jon Rodríguez Aranguren <
>> jon.r.arangu...@gmail.com>:
>>
>> 
>> Dear Spark Community Members,
>>
>> I trust this message finds you all in good health and spirits.
>>
>> I'm reaching out to the collective expertise of this esteemed community
>> with a query regarding Spark on Kubernetes. As a newcomer, I have always
>> admired the depth and breadth of knowledge shared within this forum, and it
>> is my hope that some of you might have insights on a specific challenge I'm
>> facing.
>>
>> I am currently trying to configure multiple Kubernetes secrets, notably
>> multiple S3 keys, at the SparkConf level for a Spark application. My
>> objective is to understand the best approach or methods to ensure that
>> these secrets can be smoothly accessed by the Spark application.
>>
>> If any of you have previously encountered this scenario or possess
>> relevant insights on the matter, your guidance would be highly beneficial.
>>
>> Thank you for your time and consideration. I'm eager to learn from the
>> experiences and knowledge present within this community.
>>
>> Warm regards,
>> Jon
>>
>>


Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jörn Franke
There is nowadays more a trend to move away from static credentials/certificates that are stored in a secret vault. The issue is that the rotation of them is complex, once they are leaked they can be abused, making minimal permissions feasible is cumbersome etc. That is why keyless approaches are used for A2A access (workload identity federation was mentioned). E.g. in AWS EKS you would build this on oidc (https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html) and configure this instead of using secrets. Similar approaches exist in other clouds and even on-premise (eg SPIFFE https://spiffe.io/).If this will become the standard will be difficult to say - for sure they seem to more easier to manage.Since you seem to have a Kubernetes setup which means per cloud/data Centre a lot of extra work, infrastructure cost and security issues, workload Identity federation may ease this compared to a secret store.Am 01.10.2023 um 08:27 schrieb Jon Rodríguez Aranguren :Dear Jörn Franke, Jayabindu Singh and Spark Community members,Thank you profoundly for your initial insights. I feel it's necessary to provide more precision on our setup to facilitate a deeper understanding.We're interfacing with S3 Compatible storages, but our operational context is somewhat distinct. Our infrastructure doesn't lean on conventional cloud providers like AWS. Instead, we've architected our environment on On-Premise Kubernetes distributions, specifically k0s and Openshift.Our objective extends beyond just handling S3 keys. We're orchestrating a solution that integrates Azure SPNs, API Credentials, and other sensitive credentials, intending to make Kubernetes' native secrets our central management hub. The aspiration is to have a universally deployable JAR, one that can function unmodified across different ecosystems like EMR, Databricks (on both AWS and Azure), etc. Platforms like Databricks have already made strides in this direction, allowing secrets to be woven directly into the Spark Conf through mechanisms like {{secret_scope/secret_name}}, which are resolved dynamically.The spark-on-k8s-operator's user guide suggests the feasibility of mounting secrets. However, a gap exists in our understanding of how to subsequently access these mounted secret values within the Spark application's context.Here lies my inquiry: is the spark-on-k8s-operator currently equipped to support this level of integration? If it does, any elucidation on the method or best practices would be pivotal for our project. Alternatively, if you could point me to resources or community experts who have tackled similar challenges, it would be of immense assistance.Thank you for bearing with the intricacies of our query, and I appreciate your continued guidance in this endeavor.Warm regards,Jon Rodríguez Aranguren.El sáb, 30 sept 2023 a las 23:19, Jayabindu Singh (<jayabi...@gmail.com>) escribió:Hi Jon,Using IAM as suggested by Jorn is the best approach.We recently moved our spark workload from HDP to Spark on K8 and utilizing IAM.It will save you from secret management headaches and also allows a lot more flexibility on access control and option to allow access to multiple S3 buckets in the same pod. We have implemented this across Azure, Google and AWS. Azure does require some extra work to make it work.On Sat, Sep 30, 2023 at 12:05 PM Jörn Franke <jornfra...@gmail.com> wrote:Don’t use static iam (s3) credentials. It is an outdated insecure method - even AWS recommend against using this for anything (cf eg https://docs.aws.amazon.com/cli/latest/userguide/cli-authentication-user.html).It is almost a guarantee to get your data stolen and your account manipulated. If you need to use kubernetes (which has its own very problematic security issues) then assign AWS IAM roles with minimal permissions to the pods (for EKS it means using OIDC, cf https://docs.aws.amazon.com/eks/latest/userguide/service_IAM_role.html).Am 30.09.2023 um 03:41 schrieb Jon Rodríguez Aranguren <jon.r.arangu...@gmail.com>:Dear Spark Community Members,I trust this message finds you all in good health and spirits.I'm reaching out to the collective expertise of this esteemed community with a query regarding Spark on Kubernetes. As a newcomer, I have always admired the depth and breadth of knowledge shared within this forum, and it is my hope that some of you might have insights on a specific challenge I'm facing.I am currently trying to configure multiple Kubernetes secrets, notably multiple S3 keys, at the SparkConf level for a Spark application. My objective is to understand the best approach or methods to ensure that these secrets can be smoothly accessed by the Spark application.If any of you have previously encountered this scenario or possess relevant insights on the matter, your guidance would be highly beneficial.Thank you for your time and consideration. I'm eager to learn from the experiences and knowledge present within this community.Warm regards,Jon




Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jörn Franke
With oidc sth comparable is possible: https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.htmlAm 01.10.2023 um 11:13 schrieb Mich Talebzadeh :It seems that workload identity is not available on AWS. Workload Identity replaces the need to use Metadata concealment on exposed storage such as s3 and gcs. The sensitive metadata protected by metadata concealment is also protected by Workload Identity.Both Google Cloud Kubernetes (GKE) and Azure Kubernetes Service support Workload Identity. Taking notes from Google Cloud:  "Workload Identity is the recommended way for your workloads running on Google Kubernetes Engine (GKE) to access Google Cloud services in a secure and manageable way."HTH

Mich Talebzadeh,Distinguished Technologist, Solutions Architect & EngineerLondonUnited Kingdom

   view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

 Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  

On Sun, 1 Oct 2023 at 06:36, Jayabindu Singh <jayabi...@gmail.com> wrote:Hi Jon,Using IAM as suggested by Jorn is the best approach.We recently moved our spark workload from HDP to Spark on K8 and utilizing IAM.It will save you from secret management headaches and also allows a lot more flexibility on access control and option to allow access to multiple S3 buckets in the same pod. We have implemented this across Azure, Google and AWS. Azure does require some extra work to make it work.On Sat, Sep 30, 2023 at 12:05 PM Jörn Franke <jornfra...@gmail.com> wrote:Don’t use static iam (s3) credentials. It is an outdated insecure method - even AWS recommend against using this for anything (cf eg https://docs.aws.amazon.com/cli/latest/userguide/cli-authentication-user.html).It is almost a guarantee to get your data stolen and your account manipulated. If you need to use kubernetes (which has its own very problematic security issues) then assign AWS IAM roles with minimal permissions to the pods (for EKS it means using OIDC, cf https://docs.aws.amazon.com/eks/latest/userguide/service_IAM_role.html).Am 30.09.2023 um 03:41 schrieb Jon Rodríguez Aranguren <jon.r.arangu...@gmail.com>:Dear Spark Community Members,I trust this message finds you all in good health and spirits.I'm reaching out to the collective expertise of this esteemed community with a query regarding Spark on Kubernetes. As a newcomer, I have always admired the depth and breadth of knowledge shared within this forum, and it is my hope that some of you might have insights on a specific challenge I'm facing.I am currently trying to configure multiple Kubernetes secrets, notably multiple S3 keys, at the SparkConf level for a Spark application. My objective is to understand the best approach or methods to ensure that these secrets can be smoothly accessed by the Spark application.If any of you have previously encountered this scenario or possess relevant insights on the matter, your guidance would be highly beneficial.Thank you for your time and consideration. I'm eager to learn from the experiences and knowledge present within this community.Warm regards,Jon




Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Mich Talebzadeh
It seems that workload identity
<https://cloud.google.com/iam/docs/workload-identity-federation> is not
available on AWS. Workload Identity replaces the need to use Metadata
concealment on exposed storage such as s3 and gcs. The sensitive metadata
protected by metadata concealment is also protected by Workload Identity.

Both Google Cloud Kubernetes (GKE
<https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity>)
and Azure Kubernetes Servi
<https://learn.microsoft.com/en-us/azure/aks/workload-identity-overview>ce
support Workload Identity. Taking notes from Google Cloud:  "Workload
Identity is the recommended way for your workloads running on Google
Kubernetes Engine (GKE) to access Google Cloud services in a secure and
manageable way."


HTH


Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 1 Oct 2023 at 06:36, Jayabindu Singh  wrote:

> Hi Jon,
>
> Using IAM as suggested by Jorn is the best approach.
> We recently moved our spark workload from HDP to Spark on K8 and utilizing
> IAM.
> It will save you from secret management headaches and also allows a lot
> more flexibility on access control and option to allow access to multiple
> S3 buckets in the same pod.
> We have implemented this across Azure, Google and AWS. Azure does require
> some extra work to make it work.
>
> On Sat, Sep 30, 2023 at 12:05 PM Jörn Franke  wrote:
>
>> Don’t use static iam (s3) credentials. It is an outdated insecure method
>> - even AWS recommend against using this for anything (cf eg
>> https://docs.aws.amazon.com/cli/latest/userguide/cli-authentication-user.html
>> ).
>> It is almost a guarantee to get your data stolen and your account
>> manipulated.
>>
>> If you need to use kubernetes (which has its own very problematic
>> security issues) then assign AWS IAM roles with minimal permissions to the
>> pods (for EKS it means using OIDC, cf
>> https://docs.aws.amazon.com/eks/latest/userguide/service_IAM_role.html).
>>
>> Am 30.09.2023 um 03:41 schrieb Jon Rodríguez Aranguren <
>> jon.r.arangu...@gmail.com>:
>>
>> 
>> Dear Spark Community Members,
>>
>> I trust this message finds you all in good health and spirits.
>>
>> I'm reaching out to the collective expertise of this esteemed community
>> with a query regarding Spark on Kubernetes. As a newcomer, I have always
>> admired the depth and breadth of knowledge shared within this forum, and it
>> is my hope that some of you might have insights on a specific challenge I'm
>> facing.
>>
>> I am currently trying to configure multiple Kubernetes secrets, notably
>> multiple S3 keys, at the SparkConf level for a Spark application. My
>> objective is to understand the best approach or methods to ensure that
>> these secrets can be smoothly accessed by the Spark application.
>>
>> If any of you have previously encountered this scenario or possess
>> relevant insights on the matter, your guidance would be highly beneficial.
>>
>> Thank you for your time and consideration. I'm eager to learn from the
>> experiences and knowledge present within this community.
>>
>> Warm regards,
>> Jon
>>
>>


Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-30 Thread Jayabindu Singh
Hi Jon,

Using IAM as suggested by Jorn is the best approach.
We recently moved our spark workload from HDP to Spark on K8 and utilizing
IAM.
It will save you from secret management headaches and also allows a lot
more flexibility on access control and option to allow access to multiple
S3 buckets in the same pod.
We have implemented this across Azure, Google and AWS. Azure does require
some extra work to make it work.

On Sat, Sep 30, 2023 at 12:05 PM Jörn Franke  wrote:

> Don’t use static iam (s3) credentials. It is an outdated insecure method -
> even AWS recommend against using this for anything (cf eg
> https://docs.aws.amazon.com/cli/latest/userguide/cli-authentication-user.html
> ).
> It is almost a guarantee to get your data stolen and your account
> manipulated.
>
> If you need to use kubernetes (which has its own very problematic security
> issues) then assign AWS IAM roles with minimal permissions to the pods (for
> EKS it means using OIDC, cf
> https://docs.aws.amazon.com/eks/latest/userguide/service_IAM_role.html).
>
> Am 30.09.2023 um 03:41 schrieb Jon Rodríguez Aranguren <
> jon.r.arangu...@gmail.com>:
>
> 
> Dear Spark Community Members,
>
> I trust this message finds you all in good health and spirits.
>
> I'm reaching out to the collective expertise of this esteemed community
> with a query regarding Spark on Kubernetes. As a newcomer, I have always
> admired the depth and breadth of knowledge shared within this forum, and it
> is my hope that some of you might have insights on a specific challenge I'm
> facing.
>
> I am currently trying to configure multiple Kubernetes secrets, notably
> multiple S3 keys, at the SparkConf level for a Spark application. My
> objective is to understand the best approach or methods to ensure that
> these secrets can be smoothly accessed by the Spark application.
>
> If any of you have previously encountered this scenario or possess
> relevant insights on the matter, your guidance would be highly beneficial.
>
> Thank you for your time and consideration. I'm eager to learn from the
> experiences and knowledge present within this community.
>
> Warm regards,
> Jon
>
>


Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-30 Thread Jörn Franke
Don’t use static iam (s3) credentials. It is an outdated insecure method - even 
AWS recommend against using this for anything (cf eg 
https://docs.aws.amazon.com/cli/latest/userguide/cli-authentication-user.html).
It is almost a guarantee to get your data stolen and your account manipulated. 

If you need to use kubernetes (which has its own very problematic security 
issues) then assign AWS IAM roles with minimal permissions to the pods (for EKS 
it means using OIDC, cf 
https://docs.aws.amazon.com/eks/latest/userguide/service_IAM_role.html).

> Am 30.09.2023 um 03:41 schrieb Jon Rodríguez Aranguren 
> :
> 
> 
> Dear Spark Community Members,
> 
> I trust this message finds you all in good health and spirits.
> 
> I'm reaching out to the collective expertise of this esteemed community with 
> a query regarding Spark on Kubernetes. As a newcomer, I have always admired 
> the depth and breadth of knowledge shared within this forum, and it is my 
> hope that some of you might have insights on a specific challenge I'm facing.
> 
> I am currently trying to configure multiple Kubernetes secrets, notably 
> multiple S3 keys, at the SparkConf level for a Spark application. My 
> objective is to understand the best approach or methods to ensure that these 
> secrets can be smoothly accessed by the Spark application.
> 
> If any of you have previously encountered this scenario or possess relevant 
> insights on the matter, your guidance would be highly beneficial.
> 
> Thank you for your time and consideration. I'm eager to learn from the 
> experiences and knowledge present within this community.
> 
> Warm regards,
> Jon


Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-29 Thread Jon Rodríguez Aranguren
Dear Spark Community Members,

I trust this message finds you all in good health and spirits.

I'm reaching out to the collective expertise of this esteemed community
with a query regarding Spark on Kubernetes. As a newcomer, I have always
admired the depth and breadth of knowledge shared within this forum, and it
is my hope that some of you might have insights on a specific challenge I'm
facing.

I am currently trying to configure multiple Kubernetes secrets, notably
multiple S3 keys, at the SparkConf level for a Spark application. My
objective is to understand the best approach or methods to ensure that
these secrets can be smoothly accessed by the Spark application.

If any of you have previously encountered this scenario or possess relevant
insights on the matter, your guidance would be highly beneficial.

Thank you for your time and consideration. I'm eager to learn from the
experiences and knowledge present within this community.

Warm regards,
Jon


SPIP: Adding work load identity to Spark on Kubernetes documents (supersedes Secret Management)

2023-02-20 Thread Mich Talebzadeh
Hi,

I would like to propose that the current Secret Management
<https://spark.apache.org/docs/latest/running-on-kubernetes.html#secret-management>
in
Spark Kubernetes documentation to include the more secure credentials
Workload identity) for Spark to access Kubernetes services.


Both Google Cloud Kubernetes (GKE
<https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity>)
and Azure Kubernetes Servi
<https://learn.microsoft.com/en-us/azure/aks/workload-identity-overview>ce
support Workload Identity.


Taking notes from Google Cloud "Workload Identity is the recommended way
for your workloads running on Google Kubernetes Engine (GKE) to access
Google Cloud services in a secure and manageable way."


Workload Identity replaces the need to use Metadata concealment. The
sensitive metadata protected by metadata concealment is also protected by
Workload Identity.


In the usual way we had secret management that had to be put on a
shared drive that nodes of K8s cluster could access it. Thi was normally on
Cloud Storage and exposed the following


kubectl create secret generic spark-sa  --namespace=spark
--from-file=./spark-sa.json


that spark-sa.json file contained the following:


{
  "type": "service_account",
  "project_id": "",
  "private_key_id": "7a0d67d19c5d74337792c2320d698085e9",
  "private_key": "-BEGIN PRIVATE
KEY-\nMIIEvQIBADANBgkqhkiG9w0BAQEFAASCBKcwggSjAgEAAoIBAQDTsqxsyqDP4ViY\nhO0y7INv+tr8pKEz630DOkjI/kzKfvYelzlrjZ+/EAkqOymCzIIF1LsRG8y//G3/\nzUGR2tcUKbeEaeaJJtG3tGJfCnEoApL3+jA7OvNEbJoeFsMgZ82cDXeZtYdmPdX0\nd1gwpb1yrzBckecsuG0yHs0biz9pwR7xvIPjEo26AcrFvQeOLY2P60UM40AED0F+\n23QtlsXBTjMaWih020fWNlVJSaA+FkVGfSMgQ233/5qeVeLOIBJ9BDgxf4M9OYZO\n

.

PRIVATE KEY-\n",
  "client_email": "spark-bq@.iam.gserviceaccount.com",
  "client_id": "10032476552331",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth;,
  "token_uri": "https://oauth2.googleapis.com/token;,
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs
",
  "client_x509_cert_url": "
https://www.googleapis.com/robot/v1/metadata/x509/spark-bq%.
iam.gserviceaccount.com"
}


Cloud service account keys do not expire and require manual rotation.
Exporting service account keys has the potential to expand the scope of a
security breach if it goes undetected. If an exported key is stolen, an
attacker can use it to authenticate as that service account until noticed
and manually the key is revoked.That has a lot of stuff that could be read
on the mount directory.


Let me know your thoughts


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-16 Thread Mich Talebzadeh
You can try this

gsutil cp src/StructuredStream-on-gke.py gs://codes/

where you create a bucket on gcs called codes


Then in you spark-submit do


spark-submit --verbose \
   --master k8s://https://$KUBERNETES_MASTER_IP:443 \
   --deploy-mode cluster \
   --name  \
  --conf spark.kubernetes.driver.container.image=pyspark-example:0.1
  \

   --conf spark.kubernetes.executor.container.image=
pyspark-example:0.1  \

gs://codes/StructuredStream-on-gke.py



HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 15 Feb 2023 at 21:17, karan alang  wrote:

> thnks, Mich .. let me check this
>
>
>
> On Wed, Feb 15, 2023 at 1:42 AM Mich Talebzadeh 
> wrote:
>
>>
>> It may help to check this article of mine
>>
>>
>> Spark on Kubernetes, A Practitioner’s Guide
>> <https://www.linkedin.com/pulse/spark-kubernetes-practitioners-guide-mich-talebzadeh-ph-d-/?trackingId=FDQORri0TBeJl02p3D%2B2JA%3D%3D>
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 15 Feb 2023 at 09:12, Mich Talebzadeh 
>> wrote:
>>
>>> Your submit command
>>>
>>> spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode
>>> cluster --name pyspark-example --conf 
>>> spark.kubernetes.container.image=pyspark-example:0.1
>>> --conf spark.kubernetes.file.upload.path=/myexample
>>> src/StructuredStream-on-gke.py
>>>
>>>
>>> pay attention to what it says
>>>
>>>
>>> --conf spark.kubernetes.file.upload.path
>>>
>>> That refers to your Python package on GCS storage not in the docker
>>> itself
>>>
>>>
>>> From
>>> https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management
>>>
>>>
>>> "... The app jar file will be uploaded to the S3 and then when the
>>> driver is launched it will be downloaded to the driver pod and will be
>>> added to its classpath. Spark will generate a subdir under the upload path
>>> with a random name to avoid conflicts with spark apps running in parallel.
>>> User could manage the subdirs created according to his needs..."
>>>
>>>
>>> In your case it is gs not s3
>>>
>>>
>>> There is no point putting your python file in the docker image itself!
>>>
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 15 Feb 2023 at 07:46, karan alang  wrote:
>>>
>>>> Hi Ye,
>>>>
>>>> This is the error i get when i don't set the
>>>> spark.kubernetes.file.upload.path
>>>>
>>>> Any ideas on how to fix this ?
>>>>
>>>> ```
>>>>
>>>> Exception in thread "main" org.apache.spark.SparkException: Please
>>>> specify spark.kubernetes.file.upload.path property.
>>>>
>>>> at
>>>> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(Kuber

Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-15 Thread karan alang
thnks, Mich .. let me check this



On Wed, Feb 15, 2023 at 1:42 AM Mich Talebzadeh 
wrote:

>
> It may help to check this article of mine
>
>
> Spark on Kubernetes, A Practitioner’s Guide
> <https://www.linkedin.com/pulse/spark-kubernetes-practitioners-guide-mich-talebzadeh-ph-d-/?trackingId=FDQORri0TBeJl02p3D%2B2JA%3D%3D>
>
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 15 Feb 2023 at 09:12, Mich Talebzadeh 
> wrote:
>
>> Your submit command
>>
>> spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode
>> cluster --name pyspark-example --conf 
>> spark.kubernetes.container.image=pyspark-example:0.1
>> --conf spark.kubernetes.file.upload.path=/myexample
>> src/StructuredStream-on-gke.py
>>
>>
>> pay attention to what it says
>>
>>
>> --conf spark.kubernetes.file.upload.path
>>
>> That refers to your Python package on GCS storage not in the docker itself
>>
>>
>> From
>> https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management
>>
>>
>> "... The app jar file will be uploaded to the S3 and then when the
>> driver is launched it will be downloaded to the driver pod and will be
>> added to its classpath. Spark will generate a subdir under the upload path
>> with a random name to avoid conflicts with spark apps running in parallel.
>> User could manage the subdirs created according to his needs..."
>>
>>
>> In your case it is gs not s3
>>
>>
>> There is no point putting your python file in the docker image itself!
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 15 Feb 2023 at 07:46, karan alang  wrote:
>>
>>> Hi Ye,
>>>
>>> This is the error i get when i don't set the
>>> spark.kubernetes.file.upload.path
>>>
>>> Any ideas on how to fix this ?
>>>
>>> ```
>>>
>>> Exception in thread "main" org.apache.spark.SparkException: Please
>>> specify spark.kubernetes.file.upload.path property.
>>>
>>> at
>>> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:299)
>>>
>>> at
>>> org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:248)
>>>
>>> at
>>> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>>>
>>> at
>>> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>>>
>>> at
>>> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>>>
>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>>>
>>> at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>>>
>>> at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>>>
>>> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>>>
>>> at
>>> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:247)
>>>
>>> at
>>> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:173)
>>>
>>> at scala.collection.immutable.List.foreach(List.scala:392)
>>>
>>> at
>>> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:164)
>>&

Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-15 Thread Mich Talebzadeh
It may help to check this article of mine


Spark on Kubernetes, A Practitioner’s Guide
<https://www.linkedin.com/pulse/spark-kubernetes-practitioners-guide-mich-talebzadeh-ph-d-/?trackingId=FDQORri0TBeJl02p3D%2B2JA%3D%3D>


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 15 Feb 2023 at 09:12, Mich Talebzadeh 
wrote:

> Your submit command
>
> spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode
> cluster --name pyspark-example --conf 
> spark.kubernetes.container.image=pyspark-example:0.1
> --conf spark.kubernetes.file.upload.path=/myexample
> src/StructuredStream-on-gke.py
>
>
> pay attention to what it says
>
>
> --conf spark.kubernetes.file.upload.path
>
> That refers to your Python package on GCS storage not in the docker itself
>
>
> From
> https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management
>
>
> "... The app jar file will be uploaded to the S3 and then when the driver
> is launched it will be downloaded to the driver pod and will be added to
> its classpath. Spark will generate a subdir under the upload path with a
> random name to avoid conflicts with spark apps running in parallel. User
> could manage the subdirs created according to his needs..."
>
>
> In your case it is gs not s3
>
>
> There is no point putting your python file in the docker image itself!
>
>
> HTH
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 15 Feb 2023 at 07:46, karan alang  wrote:
>
>> Hi Ye,
>>
>> This is the error i get when i don't set the
>> spark.kubernetes.file.upload.path
>>
>> Any ideas on how to fix this ?
>>
>> ```
>>
>> Exception in thread "main" org.apache.spark.SparkException: Please
>> specify spark.kubernetes.file.upload.path property.
>>
>> at
>> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:299)
>>
>> at
>> org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:248)
>>
>> at
>> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>>
>> at
>> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>>
>> at
>> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>>
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>>
>> at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>>
>> at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>>
>> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>>
>> at
>> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:247)
>>
>> at
>> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:173)
>>
>> at scala.collection.immutable.List.foreach(List.scala:392)
>>
>> at
>> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:164)
>>
>> at
>> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60)
>>
>> at
>> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
>>
>> at
>> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
>>
>> at scala.collection.immutable.List.foldLeft(List.scala:89)
>>
>> at
>> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)
>>
>> at
>> org.apache.spark.deploy.k8s.submit.Client.r

Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-15 Thread Mich Talebzadeh
Your submit command

spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode cluster
--name pyspark-example --conf
spark.kubernetes.container.image=pyspark-example:0.1
--conf spark.kubernetes.file.upload.path=/myexample
src/StructuredStream-on-gke.py


pay attention to what it says


--conf spark.kubernetes.file.upload.path

That refers to your Python package on GCS storage not in the docker itself


From
https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management


"... The app jar file will be uploaded to the S3 and then when the driver
is launched it will be downloaded to the driver pod and will be added to
its classpath. Spark will generate a subdir under the upload path with a
random name to avoid conflicts with spark apps running in parallel. User
could manage the subdirs created according to his needs..."


In your case it is gs not s3


There is no point putting your python file in the docker image itself!


HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 15 Feb 2023 at 07:46, karan alang  wrote:

> Hi Ye,
>
> This is the error i get when i don't set the
> spark.kubernetes.file.upload.path
>
> Any ideas on how to fix this ?
>
> ```
>
> Exception in thread "main" org.apache.spark.SparkException: Please specify
> spark.kubernetes.file.upload.path property.
>
> at
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:299)
>
> at
> org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:248)
>
> at
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>
> at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>
> at
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>
> at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>
> at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>
> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>
> at
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:247)
>
> at
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:173)
>
> at scala.collection.immutable.List.foreach(List.scala:392)
>
> at
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:164)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60)
>
> at
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
>
> at
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
>
> at scala.collection.immutable.List.foldLeft(List.scala:89)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)
>
> at
> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)
>
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2622)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)
>
> at org.apache.spark.deploy.SparkSubmit.org
> $apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
>
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>
> at
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
>
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
>
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> ```
>
> On Tue, Feb 14, 2023 at 1:33 AM Ye Xianjin  wrote:
>
>> The configuration of ‘…file.upload.path’ is wrong. it means a distributed
>> fs path to store your archives/resource/jars temporarily, then distributed
>> by spark to drivers/executors.
>> For your cases, you don’t need to set this configuration.
>> Sent from my 

Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-14 Thread karan alang
Hi Ye,

This is the error i get when i don't set the
spark.kubernetes.file.upload.path

Any ideas on how to fix this ?

```

Exception in thread "main" org.apache.spark.SparkException: Please specify
spark.kubernetes.file.upload.path property.

at
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:299)

at
org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:248)

at
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)

at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)

at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)

at scala.collection.TraversableLike.map(TraversableLike.scala:238)

at scala.collection.TraversableLike.map$(TraversableLike.scala:231)

at scala.collection.AbstractTraversable.map(Traversable.scala:108)

at
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:247)

at
org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:173)

at scala.collection.immutable.List.foreach(List.scala:392)

at
org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:164)

at
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:60)

at
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)

at
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)

at scala.collection.immutable.List.foldLeft(List.scala:89)

at
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)

at
org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106)

at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)

at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)

at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2622)

at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)

at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)

at org.apache.spark.deploy.SparkSubmit.org
$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)

at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)

at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)

at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)

at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
```

On Tue, Feb 14, 2023 at 1:33 AM Ye Xianjin  wrote:

> The configuration of ‘…file.upload.path’ is wrong. it means a distributed
> fs path to store your archives/resource/jars temporarily, then distributed
> by spark to drivers/executors.
> For your cases, you don’t need to set this configuration.
> Sent from my iPhone
>
> On Feb 14, 2023, at 5:43 AM, karan alang  wrote:
>
> 
> Hello All,
>
> I'm trying to run a simple application on GKE (Kubernetes), and it is
> failing:
> Note : I have spark(bitnami spark chart) installed on GKE using helm
> install
>
> Here is what is done :
> 1. created a docker image using Dockerfile
>
> Dockerfile :
> ```
>
> FROM python:3.7-slim
>
> RUN apt-get update && \
> apt-get install -y default-jre && \
> apt-get install -y openjdk-11-jre-headless && \
> apt-get clean
>
> ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64
>
> RUN pip install pyspark
> RUN mkdir -p /myexample && chmod 755 /myexample
> WORKDIR /myexample
>
> COPY src/StructuredStream-on-gke.py /myexample/StructuredStream-on-gke.py
>
> CMD ["pyspark"]
>
> ```
> Simple pyspark application :
> ```
>
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.appName("StructuredStreaming-on-gke").getOrCreate()
>
> data = [('k1', 123000), ('k2', 234000), ('k3', 456000)]
> df = spark.createDataFrame(data, ('id', 'salary'))
>
> df.show(5, False)
>
> ```
>
> Spark-submit command :
> ```
>
> spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode
> cluster --name pyspark-example --conf
> spark.kubernetes.container.image=pyspark-example:0.1 --conf
> spark.kubernetes.file.upload.path=/myexample src/StructuredStream-on-gke.py
> ```
>
> Error i get :
> ```
>
> 23/02/13 13:18:27 INFO KubernetesUtils: Uploading file:
> /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py
> to dest:
> /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a/StructuredStream-on-gke.py...
>
> Exception in 

Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-14 Thread Ye Xianjin
The configuration of ‘…file.upload.path’ is wrong. it means a distributed fs path to store your archives/resource/jars temporarily, then distributed by spark to drivers/executors. For your cases, you don’t need to set this configuration.Sent from my iPhoneOn Feb 14, 2023, at 5:43 AM, karan alang  wrote:Hello All,I'm trying to run a simple application on GKE (Kubernetes), and it is failing:Note : I have spark(bitnami spark chart) installed on GKE using helm install  Here is what is done :1. created a docker image using DockerfileDockerfile :```FROM python:3.7-slimRUN apt-get update && \apt-get install -y default-jre && \apt-get install -y openjdk-11-jre-headless && \apt-get cleanENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64RUN pip install pysparkRUN mkdir -p /myexample && chmod 755 /myexampleWORKDIR /myexampleCOPY src/StructuredStream-on-gke.py /myexample/StructuredStream-on-gke.pyCMD ["pyspark"]```Simple pyspark application :```from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("StructuredStreaming-on-gke").getOrCreate()data = "" style="color:rgb(0,128,0);font-weight:bold">'k1', 123000), ('k2', 234000), ('k3', 456000)]df = spark.createDataFrame(data, ('id', 'salary'))df.show(5, False)```Spark-submit command :```





spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode cluster --name pyspark-example --conf spark.kubernetes.container.image=pyspark-example:0.1 --conf spark.kubernetes.file.upload.path=/myexample src/StructuredStream-on-gke.py```Error i get :```





23/02/13 13:18:27 INFO KubernetesUtils: Uploading file: /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py to dest: /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a/StructuredStream-on-gke.py...
Exception in thread "main" org.apache.spark.SparkException: Uploading file /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py failed...
	at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:296)
	at org.apache.spark.deploy.k8s.KubernetesUtils$.renameMainAppResource(KubernetesUtils.scala:270)
	at org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configureForPython(DriverCommandFeatureStep.scala:109)
	at org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configurePod(DriverCommandFeatureStep.scala:44)
	at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:59)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:89)
	at org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)
	at org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)
	at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2622)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)
	at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.SparkException: Error uploading file StructuredStream-on-gke.py
	at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:319)
	at org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:292)
	... 21 more
Caused by: java.io.IOException: Mkdirs failed to create /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:317)
	at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:414)
	at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387)
	at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2369)
	at org.apache.hadoop.fs.FilterFileSystem.copyFromLocalFile(FilterFileSystem.java:368)
	at 

Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-14 Thread Khalid Mammadov
I am not k8s expert but I think you got permission issue. Try 777 as an
example to see if it works.

On Mon, 13 Feb 2023, 21:42 karan alang,  wrote:

> Hello All,
>
> I'm trying to run a simple application on GKE (Kubernetes), and it is
> failing:
> Note : I have spark(bitnami spark chart) installed on GKE using helm
> install
>
> Here is what is done :
> 1. created a docker image using Dockerfile
>
> Dockerfile :
> ```
>
> FROM python:3.7-slim
>
> RUN apt-get update && \
> apt-get install -y default-jre && \
> apt-get install -y openjdk-11-jre-headless && \
> apt-get clean
>
> ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64
>
> RUN pip install pyspark
> RUN mkdir -p /myexample && chmod 755 /myexample
> WORKDIR /myexample
>
> COPY src/StructuredStream-on-gke.py /myexample/StructuredStream-on-gke.py
>
> CMD ["pyspark"]
>
> ```
> Simple pyspark application :
> ```
>
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.appName("StructuredStreaming-on-gke").getOrCreate()
>
> data = [('k1', 123000), ('k2', 234000), ('k3', 456000)]
> df = spark.createDataFrame(data, ('id', 'salary'))
>
> df.show(5, False)
>
> ```
>
> Spark-submit command :
> ```
>
> spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode
> cluster --name pyspark-example --conf
> spark.kubernetes.container.image=pyspark-example:0.1 --conf
> spark.kubernetes.file.upload.path=/myexample src/StructuredStream-on-gke.py
> ```
>
> Error i get :
> ```
>
> 23/02/13 13:18:27 INFO KubernetesUtils: Uploading file:
> /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py
> to dest:
> /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a/StructuredStream-on-gke.py...
>
> Exception in thread "main" org.apache.spark.SparkException: Uploading file
> /Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py
> failed...
>
> at
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:296)
>
> at
> org.apache.spark.deploy.k8s.KubernetesUtils$.renameMainAppResource(KubernetesUtils.scala:270)
>
> at
> org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configureForPython(DriverCommandFeatureStep.scala:109)
>
> at
> org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configurePod(DriverCommandFeatureStep.scala:44)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:59)
>
> at
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
>
> at
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
>
> at scala.collection.immutable.List.foldLeft(List.scala:89)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)
>
> at
> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)
>
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2622)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)
>
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)
>
> at org.apache.spark.deploy.SparkSubmit.org
> $apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
>
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>
> at
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
>
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
>
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> Caused by: org.apache.spark.SparkException: Error uploading file
> StructuredStream-on-gke.py
>
> at
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:319)
>
> at
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:292)
>
> ... 21 more
>
> Caused by: java.io.IOException: Mkdirs failed to create
> /myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:317)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305)
>
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
>
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
>
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:414)
>
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387)
>
> at 

Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-13 Thread karan alang
Hello All,

I'm trying to run a simple application on GKE (Kubernetes), and it is
failing:
Note : I have spark(bitnami spark chart) installed on GKE using helm
install

Here is what is done :
1. created a docker image using Dockerfile

Dockerfile :
```

FROM python:3.7-slim

RUN apt-get update && \
apt-get install -y default-jre && \
apt-get install -y openjdk-11-jre-headless && \
apt-get clean

ENV JAVA_HOME /usr/lib/jvm/java-11-openjdk-amd64

RUN pip install pyspark
RUN mkdir -p /myexample && chmod 755 /myexample
WORKDIR /myexample

COPY src/StructuredStream-on-gke.py /myexample/StructuredStream-on-gke.py

CMD ["pyspark"]

```
Simple pyspark application :
```

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("StructuredStreaming-on-gke").getOrCreate()

data = [('k1', 123000), ('k2', 234000), ('k3', 456000)]
df = spark.createDataFrame(data, ('id', 'salary'))

df.show(5, False)

```

Spark-submit command :
```

spark-submit --master k8s://https://34.74.22.140:7077 --deploy-mode cluster
--name pyspark-example --conf
spark.kubernetes.container.image=pyspark-example:0.1 --conf
spark.kubernetes.file.upload.path=/myexample src/StructuredStream-on-gke.py
```

Error i get :
```

23/02/13 13:18:27 INFO KubernetesUtils: Uploading file:
/Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py
to dest:
/myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a/StructuredStream-on-gke.py...

Exception in thread "main" org.apache.spark.SparkException: Uploading file
/Users/karanalang/PycharmProjects/Kafka/pyspark-docker/src/StructuredStream-on-gke.py
failed...

at
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:296)

at
org.apache.spark.deploy.k8s.KubernetesUtils$.renameMainAppResource(KubernetesUtils.scala:270)

at
org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configureForPython(DriverCommandFeatureStep.scala:109)

at
org.apache.spark.deploy.k8s.features.DriverCommandFeatureStep.configurePod(DriverCommandFeatureStep.scala:44)

at
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$3(KubernetesDriverBuilder.scala:59)

at
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)

at
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)

at scala.collection.immutable.List.foldLeft(List.scala:89)

at
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:58)

at
org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106)

at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3(KubernetesClientApplication.scala:213)

at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$3$adapted(KubernetesClientApplication.scala:207)

at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2622)

at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:207)

at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:179)

at org.apache.spark.deploy.SparkSubmit.org
$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)

at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)

at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)

at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)

at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Caused by: org.apache.spark.SparkException: Error uploading file
StructuredStream-on-gke.py

at
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:319)

at
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:292)

... 21 more

Caused by: java.io.IOException: Mkdirs failed to create
/myexample/spark-upload-12228079-d652-4bf3-b907-3810d275124a

at
org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:317)

at
org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305)

at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)

at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)

at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:414)

at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387)

at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2369)

at
org.apache.hadoop.fs.FilterFileSystem.copyFromLocalFile(FilterFileSystem.java:368)

at
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:316)

... 22 more
```

Any ideas on how to fix this & get it to work ?
tia !

Pls see the stackoverflow link :

https://stackoverflow.com/questions/75441360/running-spark-application-on-gke-failing-on-spark-submit


Re: spark on kubernetes

2022-10-16 Thread Qian Sun
Glad to hear it!

On Sun, Oct 16, 2022 at 2:37 PM Mohammad Abdollahzade Arani <
mamadazar...@gmail.com> wrote:

> Hi Qian,
> Thanks for the reply and I'm So sorry for the late reply.
> I found the answer. My mistake was token conversion. I had to decode
> base64  the service accounts token and certificate.
> and you are right I have to use `service account cert` to configure
> spark.kubernetes.authenticate.caCertFile.
> Thanks again. best regards.
>
> On Sat, Oct 15, 2022 at 4:51 PM Qian Sun  wrote:
>
>> Hi Mohammad
>> Did you try this command?
>>
>>
>> ./bin/spark-submit  \ --master k8s://https://vm13:6443 \ --class 
>> com.example.WordCounter  \ --conf 
>> spark.kubernetes.authenticate.driver.serviceAccountName=default  \ 
>> --conf 
>> spark.kubernetes.container.image=private-docker-registery/spark/spark:3.2.1-3
>>  \ --conf spark.kubernetes.namespace=default \ 
>> java-word-count-1.0-SNAPSHOT.jar
>>
>> If you want spark.kubernetes.authenticate.caCertFile, you need to
>> configure it to serviceaccount certFile instead of apiserver certFile.
>>
>> On Sat, Oct 15, 2022 at 8:30 PM Mohammad Abdollahzade Arani
>> mamadazar...@gmail.com  wrote:
>>
>> I have a k8s cluster and a spark cluster.
>>>  my question is is as bellow:
>>>
>>>
>>> https://stackoverflow.com/questions/74053948/how-to-resolve-pods-is-forbidden-user-systemanonymous-cannot-watch-resourc
>>>
>>> I have searched and I found lot's of other similar questions on
>>> stackoverflow without an answer like  bellow:
>>>
>>>
>>> https://stackoverflow.com/questions/61982896/how-to-fix-pods-is-forbidden-user-systemanonymous-cannot-watch-resource
>>>
>>>
>>> --
>>> Best Wishes!
>>> Mohammad Abdollahzade Arani
>>> Computer Engineering @ SBU
>>>
>>> --
>> Best!
>> Qian Sun
>>
>
>
> --
> Best Wishes!
> Mohammad Abdollahzade Arani
> Computer Engineering @ SBU
>
>

-- 
Best!
Qian Sun


Re: spark on kubernetes

2022-10-15 Thread Qian Sun
Hi Mohammad
Did you try this command?


./bin/spark-submit  \ --master k8s://https://vm13:6443 \
--class com.example.WordCounter  \ --conf
spark.kubernetes.authenticate.driver.serviceAccountName=default  \
--conf 
spark.kubernetes.container.image=private-docker-registery/spark/spark:3.2.1-3
\ --conf spark.kubernetes.namespace=default \
java-word-count-1.0-SNAPSHOT.jar

If you want spark.kubernetes.authenticate.caCertFile, you need to configure
it to serviceaccount certFile instead of apiserver certFile.

On Sat, Oct 15, 2022 at 8:30 PM Mohammad Abdollahzade Arani
mamadazar...@gmail.com  wrote:

I have a k8s cluster and a spark cluster.
>  my question is is as bellow:
>
>
> https://stackoverflow.com/questions/74053948/how-to-resolve-pods-is-forbidden-user-systemanonymous-cannot-watch-resourc
>
> I have searched and I found lot's of other similar questions on
> stackoverflow without an answer like  bellow:
>
>
> https://stackoverflow.com/questions/61982896/how-to-fix-pods-is-forbidden-user-systemanonymous-cannot-watch-resource
>
>
> --
> Best Wishes!
> Mohammad Abdollahzade Arani
> Computer Engineering @ SBU
>
> --
Best!
Qian Sun


spark on kubernetes

2022-10-15 Thread Mohammad Abdollahzade Arani
I have a k8s cluster and a spark cluster.
 my question is is as bellow:

https://stackoverflow.com/questions/74053948/how-to-resolve-pods-is-forbidden-user-systemanonymous-cannot-watch-resourc

I have searched and I found lot's of other similar questions on
stackoverflow without an answer like  bellow:

https://stackoverflow.com/questions/61982896/how-to-fix-pods-is-forbidden-user-systemanonymous-cannot-watch-resource


-- 
Best Wishes!
Mohammad Abdollahzade Arani
Computer Engineering @ SBU


trouble using spark in kubernetes

2022-05-03 Thread Andreas Klos

Hello together,

I am trying to run a minimal example in my k8s cluster.

First, I cloned the petastorm github repo: https://github.com/uber/petastorm

Second, I created a Dockerimage as follows:

FROMubuntu:20.04
RUN apt-get update -qq
RUN apt-get install -qq -y software-properties-common
RUN add-apt-repository -yppa:deadsnakes/ppa
RUN apt-get update -qq

RUN apt-get -qq install -y \
  build-essential \
  cmake \
  openjdk-8-jre-headless \
  git \
  python \
  python3-pip \
  python3.9 \
  python3.9-dev \
  python3.9-venv \
  virtualenv \
  wget \
  && rm -rf /var/lib/apt/lists/*
RUN 
wgethttps://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist.bz2
  -P /data/mnist/
RUN mkdir /petastorm
ADD setup.py /petastorm/
ADD README.rst /petastorm/
ADD petastorm /petastorm/petastorm
RUN python3.9 -m pip install pip --upgrade
RUN python3.9 -m pip install wheel
RUN python3.9 -m venv /petastorm_venv3.9
RUN /petastorm_venv3.9/bin/pip3.9 install --no-cache scikit-build
RUN /petastorm_venv3.9/bin/pip3.9 install --no-cache -e 
/petastorm/[test,tf,torch,docs,opencv] --only-binary pyarrow --only-binary 
opencv-python
RUN /petastorm_venv3.9/bin/pip3.9 install -U pyarrow==3.0.0 numpy==1.19.3 
tensorflow==2.5.0 pyspark==3.0.0
RUN /petastorm_venv3.9/bin/pip3.9 install opencv-python-headless
RUN rm -r /petastorm
ADD docker/run_in_venv.sh /

Afterwards, I create a namespace called spark in my k8s cluster, as 
Serviceaccount (spark-driver) and a rolebinding for the service account 
as follows:


kubectl create ns spark
kubectl create serviceaccount spark-driver
kubectl create rolebinding spark-driver-rb --clusterrole=cluster-admin 
--serviceaccount=spark:spark-driver


Finally I create a pod in the spark namespace as follows:

apiVersion: v1
kind: Pod
metadata:
  name: "petastorm-ds-creator"
  namespace: spark
  labels:
    app: "petastorm-ds-creator"
spec:
  serviceAccount: spark-driver
  containers:
  - name: petastorm-ds-creator
    image: "imagename"
    command:
  - "/bin/bash"
  - "-c"
  - "--"
    args:
  - "while true; do sleep 30; done;"
    resources:
  limits:
    cpu: 2000m
    memory: 5000Mi
  requests:
    cpu: 2000m
    memory: 5000Mi
    ports:
    - containerPort:  80
  name:  http
    - containerPort:  443
  name:  https
    - containerPort:  20022
  name:  exposed
    volumeMounts:
    - name: data
  mountPath: /data
  volumes:
    - name: data
  persistentVolumeClaim:
    claimName: spark-geodata-nfs-pvc-20220503
  restartPolicy: Always

I expose port 20022 of the pod with a headless service

kubectl expose pod petastorm-ds-creator --port=20022 --type=ClusterIP 
--cluster-ip=None -n spark


finally I run the following code in the created container/pod:

from pyspark import SparkConf
from pyspark.sql import SparkSession

spark_conf = SparkConf()
spark_conf.setMaster("k8s://https://kubernetes.default:443;)
spark_conf.setAppName("PetastormDsCreator")
spark_conf.set(
    "spark.kubernetes.namespace",
    "spark"
)
spark_conf.set(
    "spark.kubernetes.authenticate.driver.serviceAccountName",
    "spark-driver"
)
spark_conf.set(
    "spark.kubernetes.authenticate.caCertFile",
    "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
)
spark_conf.set(
    "spark.kubernetes.authenticate.oauthTokenFile",
    "/var/run/secrets/kubernetes.io/serviceaccount/token"
)
spark_conf.set(
    "spark.executor.instances",
    "2"
)
spark_conf.set(
    "spark.driver.host",
    "petastorm-ds-creator"
)
spark_conf.set(
    "spark.driver.port",
    "20022"
)
spark_conf.set(
    "spark.kubernetes.container.image",
    "imagename"
)
spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()
sc = spark.sparkContext
t = sc.parallelize(range(10))
r = t.sumApprox(3)
print('Approximate sum: %s' % r)

Unfortunately, It does not work...

with kubectl describe po podname-exec-1 I get the following error message:

Error: failed to start container "spark-kubernetes-executor": Error 
response from daemon: OCI runtime create failed: container_linux.go:349: 
starting container process caused "exec: \"executor\": executable file 
not found in $PATH": unknown


Could somebody give me a hint, what am I doing wrong? Is my SparkSession 
configuration not correct?


Best regards

Andreas


Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
It uses Helm to deploy Spark Operator and Nginx. For other parts like
creating EKS, IAM role, node group, etc, it uses AWS SDK to provision those
AWS resources.

On Wed, Feb 23, 2022 at 11:28 AM Bjørn Jørgensen 
wrote:

> So if I get this right you will make a Helm <https://helm.sh> chart to
> deploy Spark and some other stuff on K8S?
>
> ons. 23. feb. 2022 kl. 17:49 skrev bo yang :
>
>> Hi Sarath, let's follow up offline on this.
>>
>> On Wed, Feb 23, 2022 at 8:32 AM Sarath Annareddy <
>> sarath.annare...@gmail.com> wrote:
>>
>>> Hi bo
>>>
>>> How do we start?
>>>
>>> Is there a plan? Onboarding, Arch/design diagram, tasks lined up etc
>>>
>>>
>>> Thanks
>>> Sarath
>>>
>>>
>>> Sent from my iPhone
>>>
>>> On Feb 23, 2022, at 10:27 AM, bo yang  wrote:
>>>
>>> 
>>> Hi Sarath, thanks for your interest and willing to contribute! The
>>> project supports local development using MiniKube. Similarly there is a one
>>> click command with one extra argument to deploy all components in MiniKube,
>>> and people could use that to develop on their local MacBook.
>>>
>>>
>>> On Wed, Feb 23, 2022 at 7:41 AM Sarath Annareddy <
>>> sarath.annare...@gmail.com> wrote:
>>>
>>>> Hi bo
>>>>
>>>> I am interested to contribute.
>>>> But I don’t have free access to any cloud provider. Not sure how I can
>>>> get free access. I know Google, aws, azure only provides temp free access,
>>>> it may not be sufficient.
>>>>
>>>> Guidance is appreciated.
>>>>
>>>> Sarath
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Feb 23, 2022, at 2:01 AM, bo yang  wrote:
>>>>
>>>> 
>>>>
>>>> Right, normally people start with simple script, then add more stuff,
>>>> like permission and more components. After some time, people want to run
>>>> the script consistently in different environments. Things will become
>>>> complex.
>>>>
>>>> That is why we want to see whether people have interest for such a "one
>>>> click" tool to make things easy.
>>>>
>>>>
>>>> On Tue, Feb 22, 2022 at 11:31 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> There are two distinct actions here; namely Deploy and Run.
>>>>>
>>>>> Deployment can be done by command line script with autoscaling. In the
>>>>> newer versions of Kubernnetes you don't even need to specify the node
>>>>> types, you can leave it to the Kubernetes cluster  to scale up and down 
>>>>> and
>>>>> decide on node type.
>>>>>
>>>>> The second point is the running spark that you will need to submit.
>>>>> However, that depends on setting up access permission, use of service
>>>>> accounts, pulling the correct dockerfiles for the driver and the 
>>>>> executors.
>>>>> Those details add to the complexity.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 23 Feb 2022 at 04:06, bo yang  wrote:
>>>>>
>>>>>> Hi Spark Community,
>>>>>>
>>>>>> We built an open source tool to deploy and run Spark on Kubernetes
>>>>>> with a one click command. For example, on AWS, it could automatically
>>>>>> create an EKS cluster, node group, NGINX ingress, and Spark Operator. 
>>>>>> Then
>>>>>> you will be able to use curl or a CLI tool to submit Spark application.
>>>>>> After the deployment, you could also install Uber Remote Shuffle Service 
>>>>>> to
>>>>>> enable Dynamic Allocation on Kuberentes.
>>>>>>
>>>>>> Anyone interested in using or working together on such a tool?
>>>>>>
>>>>>> Thanks,
>>>>>> Bo
>>>>>>
>>>>>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>


Re: One click to run Spark on Kubernetes

2022-02-23 Thread Bjørn Jørgensen
So if I get this right you will make a Helm <https://helm.sh> chart to
deploy Spark and some other stuff on K8S?

ons. 23. feb. 2022 kl. 17:49 skrev bo yang :

> Hi Sarath, let's follow up offline on this.
>
> On Wed, Feb 23, 2022 at 8:32 AM Sarath Annareddy <
> sarath.annare...@gmail.com> wrote:
>
>> Hi bo
>>
>> How do we start?
>>
>> Is there a plan? Onboarding, Arch/design diagram, tasks lined up etc
>>
>>
>> Thanks
>> Sarath
>>
>>
>> Sent from my iPhone
>>
>> On Feb 23, 2022, at 10:27 AM, bo yang  wrote:
>>
>> 
>> Hi Sarath, thanks for your interest and willing to contribute! The
>> project supports local development using MiniKube. Similarly there is a one
>> click command with one extra argument to deploy all components in MiniKube,
>> and people could use that to develop on their local MacBook.
>>
>>
>> On Wed, Feb 23, 2022 at 7:41 AM Sarath Annareddy <
>> sarath.annare...@gmail.com> wrote:
>>
>>> Hi bo
>>>
>>> I am interested to contribute.
>>> But I don’t have free access to any cloud provider. Not sure how I can
>>> get free access. I know Google, aws, azure only provides temp free access,
>>> it may not be sufficient.
>>>
>>> Guidance is appreciated.
>>>
>>> Sarath
>>>
>>> Sent from my iPhone
>>>
>>> On Feb 23, 2022, at 2:01 AM, bo yang  wrote:
>>>
>>> 
>>>
>>> Right, normally people start with simple script, then add more stuff,
>>> like permission and more components. After some time, people want to run
>>> the script consistently in different environments. Things will become
>>> complex.
>>>
>>> That is why we want to see whether people have interest for such a "one
>>> click" tool to make things easy.
>>>
>>>
>>> On Tue, Feb 22, 2022 at 11:31 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> There are two distinct actions here; namely Deploy and Run.
>>>>
>>>> Deployment can be done by command line script with autoscaling. In the
>>>> newer versions of Kubernnetes you don't even need to specify the node
>>>> types, you can leave it to the Kubernetes cluster  to scale up and down and
>>>> decide on node type.
>>>>
>>>> The second point is the running spark that you will need to submit.
>>>> However, that depends on setting up access permission, use of service
>>>> accounts, pulling the correct dockerfiles for the driver and the executors.
>>>> Those details add to the complexity.
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 23 Feb 2022 at 04:06, bo yang  wrote:
>>>>
>>>>> Hi Spark Community,
>>>>>
>>>>> We built an open source tool to deploy and run Spark on Kubernetes
>>>>> with a one click command. For example, on AWS, it could automatically
>>>>> create an EKS cluster, node group, NGINX ingress, and Spark Operator. Then
>>>>> you will be able to use curl or a CLI tool to submit Spark application.
>>>>> After the deployment, you could also install Uber Remote Shuffle Service 
>>>>> to
>>>>> enable Dynamic Allocation on Kuberentes.
>>>>>
>>>>> Anyone interested in using or working together on such a tool?
>>>>>
>>>>> Thanks,
>>>>> Bo
>>>>>
>>>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
Hi Sarath, let's follow up offline on this.

On Wed, Feb 23, 2022 at 8:32 AM Sarath Annareddy 
wrote:

> Hi bo
>
> How do we start?
>
> Is there a plan? Onboarding, Arch/design diagram, tasks lined up etc
>
>
> Thanks
> Sarath
>
>
> Sent from my iPhone
>
> On Feb 23, 2022, at 10:27 AM, bo yang  wrote:
>
> 
> Hi Sarath, thanks for your interest and willing to contribute! The project
> supports local development using MiniKube. Similarly there is a one click
> command with one extra argument to deploy all components in MiniKube, and
> people could use that to develop on their local MacBook.
>
>
> On Wed, Feb 23, 2022 at 7:41 AM Sarath Annareddy <
> sarath.annare...@gmail.com> wrote:
>
>> Hi bo
>>
>> I am interested to contribute.
>> But I don’t have free access to any cloud provider. Not sure how I can
>> get free access. I know Google, aws, azure only provides temp free access,
>> it may not be sufficient.
>>
>> Guidance is appreciated.
>>
>> Sarath
>>
>> Sent from my iPhone
>>
>> On Feb 23, 2022, at 2:01 AM, bo yang  wrote:
>>
>> 
>>
>> Right, normally people start with simple script, then add more stuff,
>> like permission and more components. After some time, people want to run
>> the script consistently in different environments. Things will become
>> complex.
>>
>> That is why we want to see whether people have interest for such a "one
>> click" tool to make things easy.
>>
>>
>> On Tue, Feb 22, 2022 at 11:31 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> There are two distinct actions here; namely Deploy and Run.
>>>
>>> Deployment can be done by command line script with autoscaling. In the
>>> newer versions of Kubernnetes you don't even need to specify the node
>>> types, you can leave it to the Kubernetes cluster  to scale up and down and
>>> decide on node type.
>>>
>>> The second point is the running spark that you will need to submit.
>>> However, that depends on setting up access permission, use of service
>>> accounts, pulling the correct dockerfiles for the driver and the executors.
>>> Those details add to the complexity.
>>>
>>> Thanks
>>>
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 23 Feb 2022 at 04:06, bo yang  wrote:
>>>
>>>> Hi Spark Community,
>>>>
>>>> We built an open source tool to deploy and run Spark on Kubernetes with
>>>> a one click command. For example, on AWS, it could automatically create an
>>>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
>>>> be able to use curl or a CLI tool to submit Spark application. After the
>>>> deployment, you could also install Uber Remote Shuffle Service to enable
>>>> Dynamic Allocation on Kuberentes.
>>>>
>>>> Anyone interested in using or working together on such a tool?
>>>>
>>>> Thanks,
>>>> Bo
>>>>
>>>>


Re: One click to run Spark on Kubernetes

2022-02-23 Thread Sarath Annareddy
Hi bo

How do we start?

Is there a plan? Onboarding, Arch/design diagram, tasks lined up etc


Thanks 
Sarath 


Sent from my iPhone

> On Feb 23, 2022, at 10:27 AM, bo yang  wrote:
> 
> 
> Hi Sarath, thanks for your interest and willing to contribute! The project 
> supports local development using MiniKube. Similarly there is a one click 
> command with one extra argument to deploy all components in MiniKube, and 
> people could use that to develop on their local MacBook.
> 
> 
>> On Wed, Feb 23, 2022 at 7:41 AM Sarath Annareddy 
>>  wrote:
>> Hi bo
>> 
>> I am interested to contribute. 
>> But I don’t have free access to any cloud provider. Not sure how I can get 
>> free access. I know Google, aws, azure only provides temp free access, it 
>> may not be sufficient.
>> 
>> Guidance is appreciated.
>> 
>> Sarath 
>> 
>> Sent from my iPhone
>> 
>>>> On Feb 23, 2022, at 2:01 AM, bo yang  wrote:
>>>> 
>>> 
>> 
>>> Right, normally people start with simple script, then add more stuff, like 
>>> permission and more components. After some time, people want to run the 
>>> script consistently in different environments. Things will become complex.
>>> 
>>> That is why we want to see whether people have interest for such a "one 
>>> click" tool to make things easy.
>>> 
>>> 
>>>> On Tue, Feb 22, 2022 at 11:31 PM Mich Talebzadeh 
>>>>  wrote:
>>>> Hi,
>>>> 
>>>> There are two distinct actions here; namely Deploy and Run.
>>>> 
>>>> Deployment can be done by command line script with autoscaling. In the 
>>>> newer versions of Kubernnetes you don't even need to specify the node 
>>>> types, you can leave it to the Kubernetes cluster  to scale up and down 
>>>> and decide on node type.
>>>> 
>>>> The second point is the running spark that you will need to submit. 
>>>> However, that depends on setting up access permission, use of service 
>>>> accounts, pulling the correct dockerfiles for the driver and the 
>>>> executors. Those details add to the complexity.
>>>> 
>>>> Thanks
>>>> 
>>>> 
>>>>view my Linkedin profile
>>>> 
>>>> 
>>>> 
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>> 
>>>>  
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>> loss, damage or destruction of data or any other property which may arise 
>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>> The author will in no case be liable for any monetary damages arising from 
>>>> such loss, damage or destruction.
>>>>  
>>>> 
>>>> 
>>>>> On Wed, 23 Feb 2022 at 04:06, bo yang  wrote:
>>>>> Hi Spark Community,
>>>>> 
>>>>> We built an open source tool to deploy and run Spark on Kubernetes with a 
>>>>> one click command. For example, on AWS, it could automatically create an 
>>>>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will 
>>>>> be able to use curl or a CLI tool to submit Spark application. After the 
>>>>> deployment, you could also install Uber Remote Shuffle Service to enable 
>>>>> Dynamic Allocation on Kuberentes.
>>>>> 
>>>>> Anyone interested in using or working together on such a tool?
>>>>> 
>>>>> Thanks,
>>>>> Bo
>>>>> 


Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
Hi Sarath, thanks for your interest and willing to contribute! The project
supports local development using MiniKube. Similarly there is a one click
command with one extra argument to deploy all components in MiniKube, and
people could use that to develop on their local MacBook.


On Wed, Feb 23, 2022 at 7:41 AM Sarath Annareddy 
wrote:

> Hi bo
>
> I am interested to contribute.
> But I don’t have free access to any cloud provider. Not sure how I can get
> free access. I know Google, aws, azure only provides temp free access, it
> may not be sufficient.
>
> Guidance is appreciated.
>
> Sarath
>
> Sent from my iPhone
>
> On Feb 23, 2022, at 2:01 AM, bo yang  wrote:
>
> 
>
> Right, normally people start with simple script, then add more stuff, like
> permission and more components. After some time, people want to run the
> script consistently in different environments. Things will become complex.
>
> That is why we want to see whether people have interest for such a "one
> click" tool to make things easy.
>
>
> On Tue, Feb 22, 2022 at 11:31 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> There are two distinct actions here; namely Deploy and Run.
>>
>> Deployment can be done by command line script with autoscaling. In the
>> newer versions of Kubernnetes you don't even need to specify the node
>> types, you can leave it to the Kubernetes cluster  to scale up and down and
>> decide on node type.
>>
>> The second point is the running spark that you will need to submit.
>> However, that depends on setting up access permission, use of service
>> accounts, pulling the correct dockerfiles for the driver and the executors.
>> Those details add to the complexity.
>>
>> Thanks
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 23 Feb 2022 at 04:06, bo yang  wrote:
>>
>>> Hi Spark Community,
>>>
>>> We built an open source tool to deploy and run Spark on Kubernetes with
>>> a one click command. For example, on AWS, it could automatically create an
>>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
>>> be able to use curl or a CLI tool to submit Spark application. After the
>>> deployment, you could also install Uber Remote Shuffle Service to enable
>>> Dynamic Allocation on Kuberentes.
>>>
>>> Anyone interested in using or working together on such a tool?
>>>
>>> Thanks,
>>> Bo
>>>
>>>


Re: One click to run Spark on Kubernetes

2022-02-23 Thread Sarath Annareddy
Hi bo

I am interested to contribute. 
But I don’t have free access to any cloud provider. Not sure how I can get free 
access. I know Google, aws, azure only provides temp free access, it may not be 
sufficient.

Guidance is appreciated.

Sarath 

Sent from my iPhone

> On Feb 23, 2022, at 2:01 AM, bo yang  wrote:
> 
> 
> Right, normally people start with simple script, then add more stuff, like 
> permission and more components. After some time, people want to run the 
> script consistently in different environments. Things will become complex.
> 
> That is why we want to see whether people have interest for such a "one 
> click" tool to make things easy.
> 
> 
>> On Tue, Feb 22, 2022 at 11:31 PM Mich Talebzadeh  
>> wrote:
>> Hi,
>> 
>> There are two distinct actions here; namely Deploy and Run.
>> 
>> Deployment can be done by command line script with autoscaling. In the newer 
>> versions of Kubernnetes you don't even need to specify the node types, you 
>> can leave it to the Kubernetes cluster  to scale up and down and decide on 
>> node type.
>> 
>> The second point is the running spark that you will need to submit. However, 
>> that depends on setting up access permission, use of service accounts, 
>> pulling the correct dockerfiles for the driver and the executors. Those 
>> details add to the complexity.
>> 
>> Thanks
>> 
>> 
>>view my Linkedin profile
>> 
>> 
>> 
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>  
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> 
>>> On Wed, 23 Feb 2022 at 04:06, bo yang  wrote:
>>> Hi Spark Community,
>>> 
>>> We built an open source tool to deploy and run Spark on Kubernetes with a 
>>> one click command. For example, on AWS, it could automatically create an 
>>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will 
>>> be able to use curl or a CLI tool to submit Spark application. After the 
>>> deployment, you could also install Uber Remote Shuffle Service to enable 
>>> Dynamic Allocation on Kuberentes.
>>> 
>>> Anyone interested in using or working together on such a tool?
>>> 
>>> Thanks,
>>> Bo
>>> 


Re: One click to run Spark on Kubernetes

2022-02-23 Thread Bitfox
from my viewpoints, if there is such a pay as you go service I would like
to use.
otherwise I have to deploy a regular spark cluster with GCP/AWS etc and the
cost is not low.

Thanks.

On Wed, Feb 23, 2022 at 4:00 PM bo yang  wrote:

> Right, normally people start with simple script, then add more stuff, like
> permission and more components. After some time, people want to run the
> script consistently in different environments. Things will become complex.
>
> That is why we want to see whether people have interest for such a "one
> click" tool to make things easy.
>
>
> On Tue, Feb 22, 2022 at 11:31 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> There are two distinct actions here; namely Deploy and Run.
>>
>> Deployment can be done by command line script with autoscaling. In the
>> newer versions of Kubernnetes you don't even need to specify the node
>> types, you can leave it to the Kubernetes cluster  to scale up and down and
>> decide on node type.
>>
>> The second point is the running spark that you will need to submit.
>> However, that depends on setting up access permission, use of service
>> accounts, pulling the correct dockerfiles for the driver and the executors.
>> Those details add to the complexity.
>>
>> Thanks
>>
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 23 Feb 2022 at 04:06, bo yang  wrote:
>>
>>> Hi Spark Community,
>>>
>>> We built an open source tool to deploy and run Spark on Kubernetes with
>>> a one click command. For example, on AWS, it could automatically create an
>>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
>>> be able to use curl or a CLI tool to submit Spark application. After the
>>> deployment, you could also install Uber Remote Shuffle Service to enable
>>> Dynamic Allocation on Kuberentes.
>>>
>>> Anyone interested in using or working together on such a tool?
>>>
>>> Thanks,
>>> Bo
>>>
>>>


Re: One click to run Spark on Kubernetes

2022-02-23 Thread bo yang
Right, normally people start with simple script, then add more stuff, like
permission and more components. After some time, people want to run the
script consistently in different environments. Things will become complex.

That is why we want to see whether people have interest for such a "one
click" tool to make things easy.


On Tue, Feb 22, 2022 at 11:31 PM Mich Talebzadeh 
wrote:

> Hi,
>
> There are two distinct actions here; namely Deploy and Run.
>
> Deployment can be done by command line script with autoscaling. In the
> newer versions of Kubernnetes you don't even need to specify the node
> types, you can leave it to the Kubernetes cluster  to scale up and down and
> decide on node type.
>
> The second point is the running spark that you will need to submit.
> However, that depends on setting up access permission, use of service
> accounts, pulling the correct dockerfiles for the driver and the executors.
> Those details add to the complexity.
>
> Thanks
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 23 Feb 2022 at 04:06, bo yang  wrote:
>
>> Hi Spark Community,
>>
>> We built an open source tool to deploy and run Spark on Kubernetes with a
>> one click command. For example, on AWS, it could automatically create an
>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
>> be able to use curl or a CLI tool to submit Spark application. After the
>> deployment, you could also install Uber Remote Shuffle Service to enable
>> Dynamic Allocation on Kuberentes.
>>
>> Anyone interested in using or working together on such a tool?
>>
>> Thanks,
>> Bo
>>
>>


Re: One click to run Spark on Kubernetes

2022-02-22 Thread Mich Talebzadeh
Hi,

There are two distinct actions here; namely Deploy and Run.

Deployment can be done by command line script with autoscaling. In the
newer versions of Kubernnetes you don't even need to specify the node
types, you can leave it to the Kubernetes cluster  to scale up and down and
decide on node type.

The second point is the running spark that you will need to submit.
However, that depends on setting up access permission, use of service
accounts, pulling the correct dockerfiles for the driver and the executors.
Those details add to the complexity.

Thanks



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 23 Feb 2022 at 04:06, bo yang  wrote:

> Hi Spark Community,
>
> We built an open source tool to deploy and run Spark on Kubernetes with a
> one click command. For example, on AWS, it could automatically create an
> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
> be able to use curl or a CLI tool to submit Spark application. After the
> deployment, you could also install Uber Remote Shuffle Service to enable
> Dynamic Allocation on Kuberentes.
>
> Anyone interested in using or working together on such a tool?
>
> Thanks,
> Bo
>
>


Re: One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
Merging another email from Prasad. It could co-exist with livy. Livy is
similar like the REST Service + Spark Operator. Unfortunately Livy is not
very active right now.

To Amihay, the link is: https://github.com/datapunchorg/punch.

On Tue, Feb 22, 2022 at 8:53 PM amihay gonen  wrote:

> Can you share link to the source?
>
> בתאריך יום ד׳, 23 בפבר׳ 2022, 6:52, מאת bo yang ‏:
>
>> We do not have SaaS yet. Now it is an open source project we build in our
>> part time , and we welcome more people working together on that.
>>
>> You could specify cluster size (EC2 instance type and number of
>> instances) and run it for 1 hour. Then you could run one click command to
>> destroy the cluster. It is possible to merge these steps as well, and
>> provide a "serverless" experience. That is in our TODO list :)
>>
>>
>> On Tue, Feb 22, 2022 at 8:36 PM Bitfox  wrote:
>>
>>> How can I specify the cluster memory and cores?
>>> For instance, I want to run a job with 16 cores and 300 GB memory for
>>> about 1 hour. Do you have the SaaS solution for this? I can pay as I did.
>>>
>>> Thanks
>>>
>>> On Wed, Feb 23, 2022 at 12:21 PM bo yang  wrote:
>>>
>>>> It is not a standalone spark cluster. In some details, it deploys a
>>>> Spark Operator (
>>>> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) and an
>>>> extra REST Service. When people submit Spark application to that REST
>>>> Service, the REST Service will create a CRD inside the Kubernetes cluster.
>>>> Then Spark Operator will pick up the CRD and launch the Spark application.
>>>> The one click tool intends to hide these details, so people could just
>>>> submit Spark and do not need to deal with too many deployment details.
>>>>
>>>> On Tue, Feb 22, 2022 at 8:09 PM Bitfox  wrote:
>>>>
>>>>> Can it be a cluster installation of spark? or just the standalone node?
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Wed, Feb 23, 2022 at 12:06 PM bo yang  wrote:
>>>>>
>>>>>> Hi Spark Community,
>>>>>>
>>>>>> We built an open source tool to deploy and run Spark on Kubernetes
>>>>>> with a one click command. For example, on AWS, it could automatically
>>>>>> create an EKS cluster, node group, NGINX ingress, and Spark Operator. 
>>>>>> Then
>>>>>> you will be able to use curl or a CLI tool to submit Spark application.
>>>>>> After the deployment, you could also install Uber Remote Shuffle Service 
>>>>>> to
>>>>>> enable Dynamic Allocation on Kuberentes.
>>>>>>
>>>>>> Anyone interested in using or working together on such a tool?
>>>>>>
>>>>>> Thanks,
>>>>>> Bo
>>>>>>
>>>>>>


Re: One click to run Spark on Kubernetes

2022-02-22 Thread amihay gonen
Can you share link to the source?

בתאריך יום ד׳, 23 בפבר׳ 2022, 6:52, מאת bo yang ‏:

> We do not have SaaS yet. Now it is an open source project we build in our
> part time , and we welcome more people working together on that.
>
> You could specify cluster size (EC2 instance type and number of instances)
> and run it for 1 hour. Then you could run one click command to destroy the
> cluster. It is possible to merge these steps as well, and provide a
> "serverless" experience. That is in our TODO list :)
>
>
> On Tue, Feb 22, 2022 at 8:36 PM Bitfox  wrote:
>
>> How can I specify the cluster memory and cores?
>> For instance, I want to run a job with 16 cores and 300 GB memory for
>> about 1 hour. Do you have the SaaS solution for this? I can pay as I did.
>>
>> Thanks
>>
>> On Wed, Feb 23, 2022 at 12:21 PM bo yang  wrote:
>>
>>> It is not a standalone spark cluster. In some details, it deploys a
>>> Spark Operator (
>>> https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) and an
>>> extra REST Service. When people submit Spark application to that REST
>>> Service, the REST Service will create a CRD inside the Kubernetes cluster.
>>> Then Spark Operator will pick up the CRD and launch the Spark application.
>>> The one click tool intends to hide these details, so people could just
>>> submit Spark and do not need to deal with too many deployment details.
>>>
>>> On Tue, Feb 22, 2022 at 8:09 PM Bitfox  wrote:
>>>
>>>> Can it be a cluster installation of spark? or just the standalone node?
>>>>
>>>> Thanks
>>>>
>>>> On Wed, Feb 23, 2022 at 12:06 PM bo yang  wrote:
>>>>
>>>>> Hi Spark Community,
>>>>>
>>>>> We built an open source tool to deploy and run Spark on Kubernetes
>>>>> with a one click command. For example, on AWS, it could automatically
>>>>> create an EKS cluster, node group, NGINX ingress, and Spark Operator. Then
>>>>> you will be able to use curl or a CLI tool to submit Spark application.
>>>>> After the deployment, you could also install Uber Remote Shuffle Service 
>>>>> to
>>>>> enable Dynamic Allocation on Kuberentes.
>>>>>
>>>>> Anyone interested in using or working together on such a tool?
>>>>>
>>>>> Thanks,
>>>>> Bo
>>>>>
>>>>>


Re: One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
We do not have SaaS yet. Now it is an open source project we build in our
part time , and we welcome more people working together on that.

You could specify cluster size (EC2 instance type and number of instances)
and run it for 1 hour. Then you could run one click command to destroy the
cluster. It is possible to merge these steps as well, and provide a
"serverless" experience. That is in our TODO list :)


On Tue, Feb 22, 2022 at 8:36 PM Bitfox  wrote:

> How can I specify the cluster memory and cores?
> For instance, I want to run a job with 16 cores and 300 GB memory for
> about 1 hour. Do you have the SaaS solution for this? I can pay as I did.
>
> Thanks
>
> On Wed, Feb 23, 2022 at 12:21 PM bo yang  wrote:
>
>> It is not a standalone spark cluster. In some details, it deploys a Spark
>> Operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator)
>> and an extra REST Service. When people submit Spark application to that
>> REST Service, the REST Service will create a CRD inside the
>> Kubernetes cluster. Then Spark Operator will pick up the CRD and launch the
>> Spark application. The one click tool intends to hide these details, so
>> people could just submit Spark and do not need to deal with too many
>> deployment details.
>>
>> On Tue, Feb 22, 2022 at 8:09 PM Bitfox  wrote:
>>
>>> Can it be a cluster installation of spark? or just the standalone node?
>>>
>>> Thanks
>>>
>>> On Wed, Feb 23, 2022 at 12:06 PM bo yang  wrote:
>>>
>>>> Hi Spark Community,
>>>>
>>>> We built an open source tool to deploy and run Spark on Kubernetes with
>>>> a one click command. For example, on AWS, it could automatically create an
>>>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
>>>> be able to use curl or a CLI tool to submit Spark application. After the
>>>> deployment, you could also install Uber Remote Shuffle Service to enable
>>>> Dynamic Allocation on Kuberentes.
>>>>
>>>> Anyone interested in using or working together on such a tool?
>>>>
>>>> Thanks,
>>>> Bo
>>>>
>>>>


Re: One click to run Spark on Kubernetes

2022-02-22 Thread Prasad Paravatha
Hi Bo Yang,
Would it be something along the lines of Apache livy?

Thanks,
Prasad


On Tue, Feb 22, 2022 at 10:22 PM bo yang  wrote:

> It is not a standalone spark cluster. In some details, it deploys a Spark
> Operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator)
> and an extra REST Service. When people submit Spark application to that
> REST Service, the REST Service will create a CRD inside the
> Kubernetes cluster. Then Spark Operator will pick up the CRD and launch the
> Spark application. The one click tool intends to hide these details, so
> people could just submit Spark and do not need to deal with too many
> deployment details.
>
> On Tue, Feb 22, 2022 at 8:09 PM Bitfox  wrote:
>
>> Can it be a cluster installation of spark? or just the standalone node?
>>
>> Thanks
>>
>> On Wed, Feb 23, 2022 at 12:06 PM bo yang  wrote:
>>
>>> Hi Spark Community,
>>>
>>> We built an open source tool to deploy and run Spark on Kubernetes with
>>> a one click command. For example, on AWS, it could automatically create an
>>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
>>> be able to use curl or a CLI tool to submit Spark application. After the
>>> deployment, you could also install Uber Remote Shuffle Service to enable
>>> Dynamic Allocation on Kuberentes.
>>>
>>> Anyone interested in using or working together on such a tool?
>>>
>>> Thanks,
>>> Bo
>>>
>>>

-- 
Regards,
Prasad Paravatha


Re: One click to run Spark on Kubernetes

2022-02-22 Thread Bitfox
How can I specify the cluster memory and cores?
For instance, I want to run a job with 16 cores and 300 GB memory for about
1 hour. Do you have the SaaS solution for this? I can pay as I did.

Thanks

On Wed, Feb 23, 2022 at 12:21 PM bo yang  wrote:

> It is not a standalone spark cluster. In some details, it deploys a Spark
> Operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator)
> and an extra REST Service. When people submit Spark application to that
> REST Service, the REST Service will create a CRD inside the
> Kubernetes cluster. Then Spark Operator will pick up the CRD and launch the
> Spark application. The one click tool intends to hide these details, so
> people could just submit Spark and do not need to deal with too many
> deployment details.
>
> On Tue, Feb 22, 2022 at 8:09 PM Bitfox  wrote:
>
>> Can it be a cluster installation of spark? or just the standalone node?
>>
>> Thanks
>>
>> On Wed, Feb 23, 2022 at 12:06 PM bo yang  wrote:
>>
>>> Hi Spark Community,
>>>
>>> We built an open source tool to deploy and run Spark on Kubernetes with
>>> a one click command. For example, on AWS, it could automatically create an
>>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
>>> be able to use curl or a CLI tool to submit Spark application. After the
>>> deployment, you could also install Uber Remote Shuffle Service to enable
>>> Dynamic Allocation on Kuberentes.
>>>
>>> Anyone interested in using or working together on such a tool?
>>>
>>> Thanks,
>>> Bo
>>>
>>>


Re: One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
It is not a standalone spark cluster. In some details, it deploys a Spark
Operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) and
an extra REST Service. When people submit Spark application to that REST
Service, the REST Service will create a CRD inside the Kubernetes cluster.
Then Spark Operator will pick up the CRD and launch the Spark application.
The one click tool intends to hide these details, so people could just
submit Spark and do not need to deal with too many deployment details.

On Tue, Feb 22, 2022 at 8:09 PM Bitfox  wrote:

> Can it be a cluster installation of spark? or just the standalone node?
>
> Thanks
>
> On Wed, Feb 23, 2022 at 12:06 PM bo yang  wrote:
>
>> Hi Spark Community,
>>
>> We built an open source tool to deploy and run Spark on Kubernetes with a
>> one click command. For example, on AWS, it could automatically create an
>> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
>> be able to use curl or a CLI tool to submit Spark application. After the
>> deployment, you could also install Uber Remote Shuffle Service to enable
>> Dynamic Allocation on Kuberentes.
>>
>> Anyone interested in using or working together on such a tool?
>>
>> Thanks,
>> Bo
>>
>>


Re: One click to run Spark on Kubernetes

2022-02-22 Thread Bitfox
Can it be a cluster installation of spark? or just the standalone node?

Thanks

On Wed, Feb 23, 2022 at 12:06 PM bo yang  wrote:

> Hi Spark Community,
>
> We built an open source tool to deploy and run Spark on Kubernetes with a
> one click command. For example, on AWS, it could automatically create an
> EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
> be able to use curl or a CLI tool to submit Spark application. After the
> deployment, you could also install Uber Remote Shuffle Service to enable
> Dynamic Allocation on Kuberentes.
>
> Anyone interested in using or working together on such a tool?
>
> Thanks,
> Bo
>
>


One click to run Spark on Kubernetes

2022-02-22 Thread bo yang
Hi Spark Community,

We built an open source tool to deploy and run Spark on Kubernetes with a
one click command. For example, on AWS, it could automatically create an
EKS cluster, node group, NGINX ingress, and Spark Operator. Then you will
be able to use curl or a CLI tool to submit Spark application. After the
deployment, you could also install Uber Remote Shuffle Service to enable
Dynamic Allocation on Kuberentes.

Anyone interested in using or working together on such a tool?

Thanks,
Bo


Shuffle in Spark with Kubernetes

2021-10-27 Thread Mich Talebzadeh
As I understand Spark releases > 3 currently do not support external
shuffle. Is there any timelines when this could be available?

For now we have two parameters for Dynamic Resource Allocation. These are

 --conf spark.dynamicAllocation.enabled=true \
 --conf spark.dynamicAllocation.shuffleTracking.enabled=true \


The idea is to use dynamic resource allocation where the driver tracks the
shuffle files and evicts only executors not storing active shuffle files.
So in a nutshell these shuffle files are stored in the executors themselves
in the absence of the external shuffle. The model works on the basis
of the "one-container-per-Pod"
model   meaning that
for each node of the cluster there will be one node running the driver and
each remaining node running one executor each. If I over-provision my GKE
cluster, for example adding one redundant node and increasing the number of
executors by one it should improve the latency. Has there been any
benchmarks on this feature?


Thanks



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: Spark on Kubernetes scheduler variety

2021-07-08 Thread Mich Talebzadeh
Splendid.

Please invite me to the next meeting

mich.talebza...@gmail.com

Timezone London, UK  *GMT+1*

Thanks,


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 8 Jul 2021 at 19:04, Holden Karau  wrote:

> Hi Y'all,
>
> We had an initial meeting which went well, got some more context around
> Volcano and its near-term roadmap. Talked about the impact around scheduler
> deadlocking and some ways that we could potentially improve integration
> from the Spark side and Volcano sides respectively. I'm going to start
> creating some sub-issues under
> https://issues.apache.org/jira/browse/SPARK-36057
>
> If anyone is interested in being on the next meeting please reach out and
> I'll send an e-mail around to try and schedule re-occurring sync that works
> for folks.
>
> Cheers,
>
> Holden
>
> On Thu, Jun 24, 2021 at 8:56 AM Holden Karau  wrote:
>
>> That's awesome, I'm just starting to get context around Volcano but maybe
>> we can schedule an initial meeting for all of us interested in pursuing
>> this to get on the same page.
>>
>> On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma  wrote:
>>
>>> Hi team,
>>>
>>> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
>>> community also has such requirements :)
>>>
>>> Volcano provides several features for batch workload, e.g. fair-share,
>>> queue, reservation, preemption/reclaim and so on.
>>> It has been used in several product environments with Spark; if
>>> necessary, I can give an overall introduction about Volcano's features and
>>> those use cases :)
>>>
>>> -- Klaus
>>>
>>> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> Please allow me to be diverse and express a different point of view on
>>>> this roadmap.
>>>>
>>>>
>>>> I believe from a technical point of view spending time and effort plus
>>>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>>>> may say I doubt whether such an approach and the so-called democratization
>>>> of Spark on whatever platform is really should be of great focus.
>>>>
>>>> Having worked on Google Dataproc <https://cloud.google.com/dataproc> (A 
>>>> fully
>>>> managed and highly scalable service for running Apache Spark, Hadoop and
>>>> more recently other artefacts) for that past two years, and Spark on
>>>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>>>> beast that that one can fully commoditize it much like one can do with
>>>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>>>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>>>> effortlessly on these commercial platforms with whatever as a Service.
>>>>
>>>>
>>>> Moreover, Spark (and I stand corrected) from the ground up has already
>>>> a lot of resiliency and redundancy built in. It is truly an enterprise
>>>> class product (requires enterprise class support) that will be difficult to
>>>> commoditize with Kubernetes and expect the same performance. After all,
>>>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>>>> for the mass market. In short I can see commercial enterprises will work on
>>>> these platforms ,but may be the great talents on dev team should focus on
>>>> stuff like the perceived limitation of SSS in dealing with chain of
>>>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>>>
>>>>
>>>> These are my opinions and they are not facts, just opinions so to speak
>>>> :)
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying 

Re: Spark on Kubernetes scheduler variety

2021-07-08 Thread Holden Karau
Hi Y'all,

We had an initial meeting which went well, got some more context around
Volcano and its near-term roadmap. Talked about the impact around scheduler
deadlocking and some ways that we could potentially improve integration
from the Spark side and Volcano sides respectively. I'm going to start
creating some sub-issues under
https://issues.apache.org/jira/browse/SPARK-36057

If anyone is interested in being on the next meeting please reach out and
I'll send an e-mail around to try and schedule re-occurring sync that works
for folks.

Cheers,

Holden

On Thu, Jun 24, 2021 at 8:56 AM Holden Karau  wrote:

> That's awesome, I'm just starting to get context around Volcano but maybe
> we can schedule an initial meeting for all of us interested in pursuing
> this to get on the same page.
>
> On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma  wrote:
>
>> Hi team,
>>
>> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
>> community also has such requirements :)
>>
>> Volcano provides several features for batch workload, e.g. fair-share,
>> queue, reservation, preemption/reclaim and so on.
>> It has been used in several product environments with Spark; if
>> necessary, I can give an overall introduction about Volcano's features and
>> those use cases :)
>>
>> -- Klaus
>>
>> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>>
>>>
>>> Please allow me to be diverse and express a different point of view on
>>> this roadmap.
>>>
>>>
>>> I believe from a technical point of view spending time and effort plus
>>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>>> may say I doubt whether such an approach and the so-called democratization
>>> of Spark on whatever platform is really should be of great focus.
>>>
>>> Having worked on Google Dataproc <https://cloud.google.com/dataproc> (A 
>>> fully
>>> managed and highly scalable service for running Apache Spark, Hadoop and
>>> more recently other artefacts) for that past two years, and Spark on
>>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>>> beast that that one can fully commoditize it much like one can do with
>>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>>> effortlessly on these commercial platforms with whatever as a Service.
>>>
>>>
>>> Moreover, Spark (and I stand corrected) from the ground up has already a
>>> lot of resiliency and redundancy built in. It is truly an enterprise class
>>> product (requires enterprise class support) that will be difficult to
>>> commoditize with Kubernetes and expect the same performance. After all,
>>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>>> for the mass market. In short I can see commercial enterprises will work on
>>> these platforms ,but may be the great talents on dev team should focus on
>>> stuff like the perceived limitation of SSS in dealing with chain of
>>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>>
>>>
>>> These are my opinions and they are not facts, just opinions so to speak
>>> :)
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>>
>>>> I think these approaches are good, but there are limitations (eg
>>>> dynamic scaling) without us making changes inside of the Spark Kube
>>>> scheduler.
>>>>
>>>> Certainly whichever scheduler extensions we add support for we should
>>>> collaborate with the people developing those extensions insofar as they are
>>>> interested. My first place that I checked was #sig-scheduling which is
>>>> fairly quite on the Kubernetes slack but if there are more places to look
>>>> for folks interested in batch scheduling

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Mich Talebzadeh
Hi Holden,

Thank you for your points. I guess coming from a corporate world I had an
oversight on how an open source project like Spark does leverage resources
and interest :).

As @KlausMa kindly volunteered it would be good to hear scheduling ideas on
Spark on Kubernetes and of course as I am sure you have some inroads/ideas
on this subject as well, then truly I guess love would be in the air for
Kubernetes <https://www.youtube.com/watch?v=NNC0kIzM1Fo>

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 24 Jun 2021 at 16:59, Holden Karau  wrote:

> Hi Mich,
>
> I certainly think making Spark on Kubernetes run well is going to be a
> challenge. However I think, and I could be wrong about this as well, that
> in terms of cluster managers Kubernetes is likely to be our future. Talking
> with people I don't hear about new standalone, YARN or mesos deployments of
> Spark, but I do hear about people trying to migrate to Kubernetes.
>
> To be clear I certainly agree that we need more work on structured
> streaming, but its important to remember that the Spark developers are not
> all fully interchangeable, we work on the things that we're interested in
> pursuing so even if structured streaming needs more love if I'm not super
> interested in structured streaming I'm less likely to work on it. That
> being said I am certainly spinning up a bit more in the Spark SQL area
> especially around our data source/connectors because I can see the need
> there too.
>
> On Wed, Jun 23, 2021 at 8:26 AM Mich Talebzadeh 
> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc <https://cloud.google.com/dataproc> (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> in

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Holden Karau
That's awesome, I'm just starting to get context around Volcano but maybe
we can schedule an initial meeting for all of us interested in pursuing
this to get on the same page.

On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma  wrote:

> Hi team,
>
> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
> community also has such requirements :)
>
> Volcano provides several features for batch workload, e.g. fair-share,
> queue, reservation, preemption/reclaim and so on.
> It has been used in several product environments with Spark; if necessary,
> I can give an overall introduction about Volcano's features and those use
> cases :)
>
> -- Klaus
>
> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc <https://cloud.google.com/dataproc> (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Regarding your point and I quote
>>>>
>>>> "..  I know that one of the Spark on Kube operators
>>>> supports volcano/kube-batch so I was thinking that might be a place I would
>>>> start exploring..."
>>>>
>>>> There seems to be ongoing work on say Volcano as part of  Cloud Native
>>>> Computing Foundation <https://cncf.io/> (CNCF). For example through
>>>> https://github.com/volcano-sh/volcano
>>>>
>>> <https://github.com/volcano-sh/volcano>
>>>>
>>>> There may be value-add in collaborating with such groups through CNCF
>>>> in order to have a c

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Lalwani, Jayesh
You can always chain aggregations by chaining multiple Structured Streaming 
jobs. It’s not a showstopper.

Getting Spark on Kubernetes is important for organizations that want to pursue 
a multi-cloud strategy

From: Mich Talebzadeh 
Date: Wednesday, June 23, 2021 at 11:27 AM
To: "user @spark" 
Cc: dev 
Subject: RE: [EXTERNAL] Spark on Kubernetes scheduler variety


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.




Please allow me to be diverse and express a different point of view on this 
roadmap.

I believe from a technical point of view spending time and effort plus talent 
on batch scheduling on Kubernetes could be rewarding. However, if I may say I 
doubt whether such an approach and the so-called democratization of Spark on 
whatever platform is really should be of great focus.
Having worked on Google Dataproc<https://cloud.google.com/dataproc> (A fully 
managed and highly scalable service for running Apache Spark, Hadoop and more 
recently other artefacts) for that past two years, and Spark on Kubernetes 
on-premise, I have come to the conclusion that Spark is not a beast that that 
one can fully commoditize it much like one can do with  Zookeeper, Kafka etc. 
There is always a struggle to make some niche areas of Spark like Spark 
Structured Streaming (SSS) work seamlessly and effortlessly on these commercial 
platforms with whatever as a Service.

Moreover, Spark (and I stand corrected) from the ground up has already a lot of 
resiliency and redundancy built in. It is truly an enterprise class product 
(requires enterprise class support) that will be difficult to commoditize with 
Kubernetes and expect the same performance. After all, Kubernetes is aimed at 
efficient resource sharing and potential cost saving for the mass market. In 
short I can see commercial enterprises will work on these platforms ,but may be 
the great talents on dev team should focus on stuff like the perceived 
limitation of SSS in dealing with chain of aggregation( if I am correct it is 
not yet supported on streaming datasets)

These are my opinions and they are not facts, just opinions so to speak :)

 [Image removed by sender.]   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 18 Jun 2021 at 23:18, Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
I think these approaches are good, but there are limitations (eg dynamic 
scaling) without us making changes inside of the Spark Kube scheduler.

Certainly whichever scheduler extensions we add support for we should 
collaborate with the people developing those extensions insofar as they are 
interested. My first place that I checked was #sig-scheduling which is fairly 
quite on the Kubernetes slack but if there are more places to look for folks 
interested in batch scheduling on Kubernetes we should definitely give it a 
shot :)

On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
Hi,

Regarding your point and I quote

"..  I know that one of the Spark on Kube operators supports volcano/kube-batch 
so I was thinking that might be a place I would start exploring..."

There seems to be ongoing work on say Volcano as part of  Cloud Native 
Computing Foundation<https://cncf.io/> (CNCF). For example through 
https://github.com/volcano-sh/volcano

There may be value-add in collaborating with such groups through CNCF in order 
to have a collective approach to such work. There also seems to be some work on 
Integration of Spark with Volcano for Batch 
Scheduling.<https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/volcano-integration.md>



What is not very clear is the degree of progress of these projects. You may be 
kind enough to elaborate on KPI for each of these projects and where you think 
your contributions is going to be.



HTH,



Mich


 [Image removed by sender.]   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 18 Jun 2021 at 00:44, Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
Hi Folks,

I'm continuing my adventures to make Spark on containers party a

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread John Zhuge
Thanks Klaus! I am interested in more details.

On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma  wrote:

> Hi team,
>
> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
> community also has such requirements :)
>
> Volcano provides several features for batch workload, e.g. fair-share,
> queue, reservation, preemption/reclaim and so on.
> It has been used in several product environments with Spark; if necessary,
> I can give an overall introduction about Volcano's features and those use
> cases :)
>
> -- Klaus
>
> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc <https://cloud.google.com/dataproc> (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Regarding your point and I quote
>>>>
>>>> "..  I know that one of the Spark on Kube operators
>>>> supports volcano/kube-batch so I was thinking that might be a place I would
>>>> start exploring..."
>>>>
>>>> There seems to be ongoing work on say Volcano as part of  Cloud Native
>>>> Computing Foundation <https://cncf.io/> (CNCF). For example through
>>>> https://github.com/volcano-sh/volcano
>>>>
>>> <https://github.com/volcano-sh/volcano>
>>>>
>>>> There may be value-add in collaborating with such groups through CNCF
>>>> in order to have a collective approach to such work. There also seems to be
>>>> some work on Integration of Spark with Volcano for Ba

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Mich Talebzadeh
Thanks Klaus. That will be great.

It will also be intuitive if you elaborate the need for this feature in
line with the limitation of the current batch workload.

Regards,

Mich



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 24 Jun 2021 at 02:53, Klaus Ma  wrote:

> Hi team,
>
> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
> community also has such requirements :)
>
> Volcano provides several features for batch workload, e.g. fair-share,
> queue, reservation, preemption/reclaim and so on.
> It has been used in several product environments with Spark; if necessary,
> I can give an overall introduction about Volcano's features and those use
> cases :)
>
> -- Klaus
>
> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc <https://cloud.google.com/dataproc> (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Regarding your point and I quote
>>>>
>>>> "..  I know that one of the Spark on Kube operators
>>>> supports volcano/kube-batch so I was thinking that might be a place I would
>>>> start exploring..."
>>>>
>>>&

Re: Spark on Kubernetes scheduler variety

2021-06-23 Thread Klaus Ma
Hi team,

I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
community also has such requirements :)

Volcano provides several features for batch workload, e.g. fair-share,
queue, reservation, preemption/reclaim and so on.
It has been used in several product environments with Spark; if necessary,
I can give an overall introduction about Volcano's features and those use
cases :)

-- Klaus

On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh 
wrote:

>
>
> Please allow me to be diverse and express a different point of view on
> this roadmap.
>
>
> I believe from a technical point of view spending time and effort plus
> talent on batch scheduling on Kubernetes could be rewarding. However, if I
> may say I doubt whether such an approach and the so-called democratization
> of Spark on whatever platform is really should be of great focus.
>
> Having worked on Google Dataproc <https://cloud.google.com/dataproc> (A fully
> managed and highly scalable service for running Apache Spark, Hadoop and
> more recently other artefacts) for that past two years, and Spark on
> Kubernetes on-premise, I have come to the conclusion that Spark is not a
> beast that that one can fully commoditize it much like one can do with
> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
> of Spark like Spark Structured Streaming (SSS) work seamlessly and
> effortlessly on these commercial platforms with whatever as a Service.
>
>
> Moreover, Spark (and I stand corrected) from the ground up has already a
> lot of resiliency and redundancy built in. It is truly an enterprise class
> product (requires enterprise class support) that will be difficult to
> commoditize with Kubernetes and expect the same performance. After all,
> Kubernetes is aimed at efficient resource sharing and potential cost saving
> for the mass market. In short I can see commercial enterprises will work on
> these platforms ,but may be the great talents on dev team should focus on
> stuff like the perceived limitation of SSS in dealing with chain of
> aggregation( if I am correct it is not yet supported on streaming datasets)
>
>
> These are my opinions and they are not facts, just opinions so to speak :)
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>
>> I think these approaches are good, but there are limitations (eg dynamic
>> scaling) without us making changes inside of the Spark Kube scheduler.
>>
>> Certainly whichever scheduler extensions we add support for we should
>> collaborate with the people developing those extensions insofar as they are
>> interested. My first place that I checked was #sig-scheduling which is
>> fairly quite on the Kubernetes slack but if there are more places to look
>> for folks interested in batch scheduling on Kubernetes we should definitely
>> give it a shot :)
>>
>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Regarding your point and I quote
>>>
>>> "..  I know that one of the Spark on Kube operators
>>> supports volcano/kube-batch so I was thinking that might be a place I would
>>> start exploring..."
>>>
>>> There seems to be ongoing work on say Volcano as part of  Cloud Native
>>> Computing Foundation <https://cncf.io/> (CNCF). For example through
>>> https://github.com/volcano-sh/volcano
>>>
>> <https://github.com/volcano-sh/volcano>
>>>
>>> There may be value-add in collaborating with such groups through CNCF in
>>> order to have a collective approach to such work. There also seems to be
>>> some work on Integration of Spark with Volcano for Batch Scheduling.
>>> <https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/volcano-integration.md>
>>>
>>>
>>>
>>> What is not very clear is the degree of progress of these projects. You
>>> may be kind enough to elaborate on KPI for each of these projects and where
>>> you think your contributions is going to be.
>>>
>>>
>>> HTH,
>>>
>>>
>>> Mich
>>>
>>>
>>&

Re: Spark on Kubernetes scheduler variety

2021-06-23 Thread Mich Talebzadeh
Please allow me to be diverse and express a different point of view on
this roadmap.


I believe from a technical point of view spending time and effort plus
talent on batch scheduling on Kubernetes could be rewarding. However, if I
may say I doubt whether such an approach and the so-called democratization
of Spark on whatever platform is really should be of great focus.

Having worked on Google Dataproc <https://cloud.google.com/dataproc> (A fully
managed and highly scalable service for running Apache Spark, Hadoop and
more recently other artefacts) for that past two years, and Spark on
Kubernetes on-premise, I have come to the conclusion that Spark is not a
beast that that one can fully commoditize it much like one can do with
Zookeeper, Kafka etc. There is always a struggle to make some niche areas
of Spark like Spark Structured Streaming (SSS) work seamlessly and
effortlessly on these commercial platforms with whatever as a Service.


Moreover, Spark (and I stand corrected) from the ground up has already a
lot of resiliency and redundancy built in. It is truly an enterprise class
product (requires enterprise class support) that will be difficult to
commoditize with Kubernetes and expect the same performance. After all,
Kubernetes is aimed at efficient resource sharing and potential cost saving
for the mass market. In short I can see commercial enterprises will work on
these platforms ,but may be the great talents on dev team should focus on
stuff like the perceived limitation of SSS in dealing with chain of
aggregation( if I am correct it is not yet supported on streaming datasets)


These are my opinions and they are not facts, just opinions so to speak :)


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:

> I think these approaches are good, but there are limitations (eg dynamic
> scaling) without us making changes inside of the Spark Kube scheduler.
>
> Certainly whichever scheduler extensions we add support for we should
> collaborate with the people developing those extensions insofar as they are
> interested. My first place that I checked was #sig-scheduling which is
> fairly quite on the Kubernetes slack but if there are more places to look
> for folks interested in batch scheduling on Kubernetes we should definitely
> give it a shot :)
>
> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> Regarding your point and I quote
>>
>> "..  I know that one of the Spark on Kube operators
>> supports volcano/kube-batch so I was thinking that might be a place I would
>> start exploring..."
>>
>> There seems to be ongoing work on say Volcano as part of  Cloud Native
>> Computing Foundation <https://cncf.io/> (CNCF). For example through
>> https://github.com/volcano-sh/volcano
>>
> <https://github.com/volcano-sh/volcano>
>>
>> There may be value-add in collaborating with such groups through CNCF in
>> order to have a collective approach to such work. There also seems to be
>> some work on Integration of Spark with Volcano for Batch Scheduling.
>> <https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/volcano-integration.md>
>>
>>
>>
>> What is not very clear is the degree of progress of these projects. You
>> may be kind enough to elaborate on KPI for each of these projects and where
>> you think your contributions is going to be.
>>
>>
>> HTH,
>>
>>
>> Mich
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 00:44, Holden Karau  wrote:
>>
>>> Hi Folks,
>>>
>>> I'm continuing my adventures to make Spark on containers party and I
>>> was wondering if folks have experience with the different batch
>>> scheduler options that they prefer? I was thinking so that we can
>>> better support dynamic al

Re: Question on spark on Kubernetes

2021-05-20 Thread Gourav Sengupta
Hi Mithalee,
lets start with why, Why are you using Kubernetes and not just EMR in EC2?

Do you have extremely bespoke library dependencies and requirements? Or
does you workloads fail in case the clusters do not scale up or down in a
few minutes?


Regards,
Gourav Sengupta

On Thu, May 20, 2021 at 9:50 PM Mithalee Mohapatra <
mithaleemohapa...@gmail.com> wrote:

> Hi,
> I am currently trying to run spark submit in Kubernetes. I have set up the
> IAM roles for serviceaccount and generated the ARN. I am trying to use the
> "spark.hadoop.fs.s3a.fast.upload=true --conf
> spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
> but getting the below error. Do I need to create a token file. What will be
> the content of the token file and how can I deploy it in the cluster.
> [image: image.png]
>


Question on spark on Kubernetes

2021-05-20 Thread Mithalee Mohapatra
Hi,
I am currently trying to run spark submit in Kubernetes. I have set up the
IAM roles for serviceaccount and generated the ARN. I am trying to use the
"spark.hadoop.fs.s3a.fast.upload=true --conf
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
but getting the below error. Do I need to create a token file. What will be
the content of the token file and how can I deploy it in the cluster.
[image: image.png]


Re: [Spark in Kubernetes] Question about running in client mode

2021-04-27 Thread Shiqi Sun
Hi Attila,

Ah that makes sense. Thanks for the clarification!

Best,
Shiqi

On Mon, Apr 26, 2021 at 8:09 PM Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:

> Hi Shiqi,
>
> In case of client mode the driver runs locally: in the same machine, even
> in the same process, of the spark submit.
>
> So if the application was submitted in a running POD then the driver will
> be running in a POD and when outside of K8s then it will be running
> outside.
> This is why there is no config mentioned for this.
>
> From the deploy mode in general you can read here:
> https://spark.apache.org/docs/latest/submitting-applications.html
>
> Best Regards,
> Attila
>
> On Tue, Apr 27, 2021 at 12:03 AM Shiqi Sun  wrote:
>
>> Hi Spark User group,
>>
>> I have a couple of quick questions about running Spark in Kubernetes
>> between different deploy modes.
>>
>> As specified in
>> https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode,
>> since Spark 2.4, client mode support is available when running in
>> Kubernetes, and it says "when your application runs in client mode, the
>> driver can run inside a pod or on a physical host". Then here come the
>> questions.
>>
>> 1. If I understand correctly, in cluster mode, the driver is also running
>> inside a k8s pod. Then, what's the difference between running it in cluster
>> mode, versus running it in client mode when I choose to run my driver in a
>> pod?
>>
>> 2. What does it mean by "running driver on a physical host"? Does it mean
>> that it runs outside of the k8s cluster? What config should I pass to spark
>> submit so that it runs this way, instead of running my driver into a k8s
>> pod?
>>
>> Thanks!
>>
>> Best,
>> Shiqi
>>
>


Re: [Spark in Kubernetes] Question about running in client mode

2021-04-26 Thread Attila Zsolt Piros
Hi Shiqi,

In case of client mode the driver runs locally: in the same machine, even
in the same process, of the spark submit.

So if the application was submitted in a running POD then the driver will
be running in a POD and when outside of K8s then it will be running
outside.
This is why there is no config mentioned for this.

>From the deploy mode in general you can read here:
https://spark.apache.org/docs/latest/submitting-applications.html

Best Regards,
Attila

On Tue, Apr 27, 2021 at 12:03 AM Shiqi Sun  wrote:

> Hi Spark User group,
>
> I have a couple of quick questions about running Spark in Kubernetes
> between different deploy modes.
>
> As specified in
> https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode,
> since Spark 2.4, client mode support is available when running in
> Kubernetes, and it says "when your application runs in client mode, the
> driver can run inside a pod or on a physical host". Then here come the
> questions.
>
> 1. If I understand correctly, in cluster mode, the driver is also running
> inside a k8s pod. Then, what's the difference between running it in cluster
> mode, versus running it in client mode when I choose to run my driver in a
> pod?
>
> 2. What does it mean by "running driver on a physical host"? Does it mean
> that it runs outside of the k8s cluster? What config should I pass to spark
> submit so that it runs this way, instead of running my driver into a k8s
> pod?
>
> Thanks!
>
> Best,
> Shiqi
>


[Spark in Kubernetes] Question about running in client mode

2021-04-26 Thread Shiqi Sun
Hi Spark User group,

I have a couple of quick questions about running Spark in Kubernetes
between different deploy modes.

As specified in
https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode,
since Spark 2.4, client mode support is available when running in
Kubernetes, and it says "when your application runs in client mode, the
driver can run inside a pod or on a physical host". Then here come the
questions.

1. If I understand correctly, in cluster mode, the driver is also running
inside a k8s pod. Then, what's the difference between running it in cluster
mode, versus running it in client mode when I choose to run my driver in a
pod?

2. What does it mean by "running driver on a physical host"? Does it mean
that it runs outside of the k8s cluster? What config should I pass to spark
submit so that it runs this way, instead of running my driver into a k8s
pod?

Thanks!

Best,
Shiqi


Re: Dynamic Allocation Backlog Property in Spark on Kubernetes

2021-04-10 Thread ranju goel
Hi Attila,

Thanks for your guidance of how to use dynamic allocation effectively for
spark job. Now I am bit more confident to set the
schedulerbacklogtimeout wisely.

In your statement *"*If there is no more available new resource for Spark
then the existing ones will be used* (even the min executors is not
guaranteed to be reached if no available resources)."*

*We discussed this thing earlier using spark jira where I wanted to kill
the job if no available resources. I finally succeeded to kill the job with
the help of  spark Events Listener Bus where I use the spark.extralisteners
and get the notification when executors gets added during SparkContext
initialization.*
*If I don't notify within 10 secs(approx) at the time of SparkContext
initialization, then I assume the Resources are unavailable and
subsequently I killed the job.*

*I saw a similar property in spark documentation :*

spark.scheduler.maxRegisteredResourcesWaitingTime 30s *Maximum amount of
time to wait for resources to register before scheduling begins.*

*Does this property mean that scheduler will wait only for 30sec and if
resources still not registered, then scheduler will forget this job but
executors will remain in Pending state. Or this property does more?*

*Best Regards*



On Sat, Apr 10, 2021 at 11:50 PM Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:

> Hi Ranju!
>
> > But if there are no extra resources available, then go for static
> allocation rather dynamic. Is it correct ?
>
> I think there is no such rule. If there is no more available new resource
> for Spark then the existing ones will be used (even the min executors is
> not guaranteed to be reached if no available resources).
>
> But I suggest to always set the max executors to a meaningful value (the
> default is too high: int max).
> This way you can avoid too high costs for a small/medium sized job where
> the tasks number is high but their size are small.
>
> Regarding your questions: in both cases as I see extra resources are
> helping and the jobs will be finished faster.
>
> Best Regards,
> Attila
>
>
> On Sat, Apr 10, 2021 at 7:01 PM ranju goel  wrote:
>
>> Hi Attila,
>>
>>
>> I understood what you mean that Use the extra resources if available for
>> running spark job, using schedulerbacklogtimeout (dynamic allocation).
>>
>> This will speeds up the job. But if there are no extra resources
>> available, then go for static allocation rather dynamic. Is it correct ?
>>
>>
>> Please validate below few scenarios for effective use of dynamic
>> allocation
>>
>>
>> 1.  Below screenshot shows, the Tasks are tiny, each task is executing
>> fast, but number of total tasks count is high (3241).
>>
>> *Dynamic Allocation Advantage for this scenario*
>>
>> If reserved spark quota has more resources available when min Executors
>> running, setting schedulerbacklogtimeout to few secs [say 15 min], those
>> available quota resources can be used and  (3241) number of tasks can be
>> finished fast. Is this understanding correct?
>>
>> [image: image.png]
>>
>>
>>
>> 2. Below report has less total number of tasks count (192) and parallel
>> running task count (24), but each task took around 7 min to complete.
>>
>> So here again, if resources are available in quota, more parallelism can
>> be achieved using schedulerbacklogtimeout (say 15 mins) and speeds up the
>> job.
>>
>>
>> [image: image.png]
>>
>> Best Regards
>>
>>
>>
>>
>>
>> *From:* Attila Zsolt Piros 
>> *Sent:* Friday, April 9, 2021 11:11 AM
>> *To:* Ranju Jain 
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Dynamic Allocation Backlog Property in Spark on Kubernetes
>>
>>
>>
>> You should not set "spark.dynamicAllocation.schedulerBacklogTimeout" so
>> high and the purpose of this config is very different form the one you
>> would like to use it for.
>>
>>
>> The confusion I guess comes from the fact that you are still thinking in
>> multiple Spark jobs.
>>
>>
>> *But Dynamic Allocation is useful in case of a single Spark job, too. *With
>> Dynamic allocation if there are pending tasks then new resources should be
>> allocated to speed up the calculation.
>> If you do not have enough partitions then you do not have enough tasks to
>> run in parallel that was my earlier comment about.
>>
>> So let's focus on your first job:
>> - With 3 executors it takes 2 hours to complete, right?
>> - And what about 8 executors?  I hope significantly less time.
>>
>> So if you have more than 3 

Re: Dynamic Allocation Backlog Property in Spark on Kubernetes

2021-04-10 Thread Attila Zsolt Piros
Hi Ranju!

> But if there are no extra resources available, then go for static
allocation rather dynamic. Is it correct ?

I think there is no such rule. If there is no more available new resource
for Spark then the existing ones will be used (even the min executors is
not guaranteed to be reached if no available resources).

But I suggest to always set the max executors to a meaningful value (the
default is too high: int max).
This way you can avoid too high costs for a small/medium sized job where
the tasks number is high but their size are small.

Regarding your questions: in both cases as I see extra resources are
helping and the jobs will be finished faster.

Best Regards,
Attila


On Sat, Apr 10, 2021 at 7:01 PM ranju goel  wrote:

> Hi Attila,
>
>
> I understood what you mean that Use the extra resources if available for
> running spark job, using schedulerbacklogtimeout (dynamic allocation).
>
> This will speeds up the job. But if there are no extra resources
> available, then go for static allocation rather dynamic. Is it correct ?
>
>
> Please validate below few scenarios for effective use of dynamic allocation
>
>
> 1.  Below screenshot shows, the Tasks are tiny, each task is executing
> fast, but number of total tasks count is high (3241).
>
> *Dynamic Allocation Advantage for this scenario*
>
> If reserved spark quota has more resources available when min Executors
> running, setting schedulerbacklogtimeout to few secs [say 15 min], those
> available quota resources can be used and  (3241) number of tasks can be
> finished fast. Is this understanding correct?
>
> [image: image.png]
>
>
>
> 2. Below report has less total number of tasks count (192) and parallel
> running task count (24), but each task took around 7 min to complete.
>
> So here again, if resources are available in quota, more parallelism can
> be achieved using schedulerbacklogtimeout (say 15 mins) and speeds up the
> job.
>
>
> [image: image.png]
>
> Best Regards
>
>
>
>
>
> *From:* Attila Zsolt Piros 
> *Sent:* Friday, April 9, 2021 11:11 AM
> *To:* Ranju Jain 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Dynamic Allocation Backlog Property in Spark on Kubernetes
>
>
>
> You should not set "spark.dynamicAllocation.schedulerBacklogTimeout" so
> high and the purpose of this config is very different form the one you
> would like to use it for.
>
>
> The confusion I guess comes from the fact that you are still thinking in
> multiple Spark jobs.
>
>
> *But Dynamic Allocation is useful in case of a single Spark job, too. *With
> Dynamic allocation if there are pending tasks then new resources should be
> allocated to speed up the calculation.
> If you do not have enough partitions then you do not have enough tasks to
> run in parallel that was my earlier comment about.
>
> So let's focus on your first job:
> - With 3 executors it takes 2 hours to complete, right?
> - And what about 8 executors?  I hope significantly less time.
>
> So if you have more than 3 partitions and the tasks are meaningfully long
> enough to request some extra resources (schedulerBacklogTimeout) and the
> number of running executors are lower than the maximum number of executors
> you set (maxExecutors) then why wouldn't you want to use those extra
> resources?
>
>
>
>
>
>
> On Fri, Apr 9, 2021 at 6:03 AM Ranju Jain  wrote:
>
> Hi Attila,
>
>
>
> Thanks for your reply.
>
>
>
> If I talk about single job which started to run with minExecutors as *3*.
> And Suppose this job [*which reads the full data from backend and process
> and writes it to a location*]
>
> takes around 2 hour to complete.
>
>
>
> What I understood is, as the default value of
> spark.dynamicAllocation.schedulerBacklogTimeout is 1 sec, so executors will
> scale from *3* to *4* and then *8* after every second if tasks are
> pending at scheduler backend. So If I don’t want  it 1 sec and I might set
> it to 1 hour [3600 sec] in 2 hour of spark job.
>
>
>
> So this is all about when I want to scale executors dynamically for spark
> job. Is that understanding correct?
>
>
>
> In the below statement I don’t understand much about available partitions
> :-(
>
> *pending tasks (which kinda related to the available partitions)*
>
>
>
>
>
> Regards
>
> Ranju
>
>
>
>
>
> *From:* Attila Zsolt Piros 
> *Sent:* Friday, April 9, 2021 12:13 AM
> *To:* Ranju Jain 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Dynamic Allocation Backlog Property in Spark on Kubernetes
>
>
>
> Hi!
>
> For dynamic allocation you do not need to run the Spark jobs in parallel.
> Dyna

Re: Dynamic Allocation Backlog Property in Spark on Kubernetes

2021-04-10 Thread ranju goel
Hi Attila,


I understood what you mean that Use the extra resources if available for
running spark job, using schedulerbacklogtimeout (dynamic allocation).

This will speeds up the job. But if there are no extra resources available,
then go for static allocation rather dynamic. Is it correct ?


Please validate below few scenarios for effective use of dynamic allocation


1.  Below screenshot shows, the Tasks are tiny, each task is executing
fast, but number of total tasks count is high (3241).

*Dynamic Allocation Advantage for this scenario*

If reserved spark quota has more resources available when min Executors
running, setting schedulerbacklogtimeout to few secs [say 15 min], those
available quota resources can be used and  (3241) number of tasks can be
finished fast. Is this understanding correct?

[image: image.png]



2. Below report has less total number of tasks count (192) and parallel
running task count (24), but each task took around 7 min to complete.

So here again, if resources are available in quota, more parallelism can be
achieved using schedulerbacklogtimeout (say 15 mins) and speeds up the job.


[image: image.png]

Best Regards





*From:* Attila Zsolt Piros 
*Sent:* Friday, April 9, 2021 11:11 AM
*To:* Ranju Jain 
*Cc:* user@spark.apache.org
*Subject:* Re: Dynamic Allocation Backlog Property in Spark on Kubernetes



You should not set "spark.dynamicAllocation.schedulerBacklogTimeout" so
high and the purpose of this config is very different form the one you
would like to use it for.


The confusion I guess comes from the fact that you are still thinking in
multiple Spark jobs.


*But Dynamic Allocation is useful in case of a single Spark job, too. *With
Dynamic allocation if there are pending tasks then new resources should be
allocated to speed up the calculation.
If you do not have enough partitions then you do not have enough tasks to
run in parallel that was my earlier comment about.

So let's focus on your first job:
- With 3 executors it takes 2 hours to complete, right?
- And what about 8 executors?  I hope significantly less time.

So if you have more than 3 partitions and the tasks are meaningfully long
enough to request some extra resources (schedulerBacklogTimeout) and the
number of running executors are lower than the maximum number of executors
you set (maxExecutors) then why wouldn't you want to use those extra
resources?






On Fri, Apr 9, 2021 at 6:03 AM Ranju Jain  wrote:

Hi Attila,



Thanks for your reply.



If I talk about single job which started to run with minExecutors as *3*.
And Suppose this job [*which reads the full data from backend and process
and writes it to a location*]

takes around 2 hour to complete.



What I understood is, as the default value of
spark.dynamicAllocation.schedulerBacklogTimeout is 1 sec, so executors will
scale from *3* to *4* and then *8* after every second if tasks are pending
at scheduler backend. So If I don’t want  it 1 sec and I might set it to 1
hour [3600 sec] in 2 hour of spark job.



So this is all about when I want to scale executors dynamically for spark
job. Is that understanding correct?



In the below statement I don’t understand much about available partitions
:-(

*pending tasks (which kinda related to the available partitions)*





Regards

Ranju





*From:* Attila Zsolt Piros 
*Sent:* Friday, April 9, 2021 12:13 AM
*To:* Ranju Jain 
*Cc:* user@spark.apache.org
*Subject:* Re: Dynamic Allocation Backlog Property in Spark on Kubernetes



Hi!

For dynamic allocation you do not need to run the Spark jobs in parallel.
Dynamic allocation simply means Spark scales up by requesting more
executors when there are pending tasks (which kinda related to the
available partitions) and scales down when the executor is idle (as within
one job the number of partitions can fluctuate).

But if you optimize for run time you can start those jobs in parallel at
the beginning.

In this case you will use higher number of executors even from the
beginning.

The "spark.dynamicAllocation.schedulerBacklogTimeout" is not for to
schedule/synchronize different Spark jobs but it is about tasks.

Best regards,
Attila



On Tue, Apr 6, 2021 at 1:59 PM Ranju Jain 
wrote:

Hi All,



I have set dynamic allocation enabled while running spark on Kubernetes .
But new executors are requested if pending tasks are backlogged for more
than configured duration in property
*“spark.dynamicAllocation.schedulerBacklogTimeout”*.



My Use Case is:



There are number of parallel jobs which might or might not run together at
a particular point of time. E.g Only One Spark Job may run at a point of
time or two spark jobs may run at a single point of time depending upon the
need.

I configured spark.dynamicAllocation.minExecutors as 3 and
spark.dynamicAllocation.maxExecutors as 8 .



Steps:

   1. SparkContext initialized with 3 executors and First Job requested.
   2. Now, if second job requested after few mins  (e.g 15 mins) , I am

Re: Dynamic Allocation Backlog Property in Spark on Kubernetes

2021-04-08 Thread Attila Zsolt Piros
You should not set "spark.dynamicAllocation.schedulerBacklogTimeout" so
high and the purpose of this config is very different form the one you
would like to use it for.

The confusion I guess comes from the fact that you are still thinking in
multiple Spark jobs.


*But Dynamic Allocation is useful in case of a single Spark job, too.*With
Dynamic allocation if there are pending tasks then new resources should be
allocated to speed up the calculation.
If you do not have enough partitions then you do not have enough tasks to
run in parallel that was my earlier comment about.

So let's focus on your first job:
- With 3 executors it takes 2 hours to complete, right?
- And what about 8 executors?  I hope significantly less time.

So if you have more than 3 partitions and the tasks are meaningfully long
enough to request some extra resources (schedulerBacklogTimeout) and the
number of running executors are lower than the maximum number of executors
you set (maxExecutors) then why wouldn't you want to use those extra
resources?



On Fri, Apr 9, 2021 at 6:03 AM Ranju Jain  wrote:

> Hi Attila,
>
>
>
> Thanks for your reply.
>
>
>
> If I talk about single job which started to run with minExecutors as *3*.
> And Suppose this job [*which reads the full data from backend and process
> and writes it to a location*]
>
> takes around 2 hour to complete.
>
>
>
> What I understood is, as the default value of
> spark.dynamicAllocation.schedulerBacklogTimeout is 1 sec, so executors will
> scale from *3* to *4* and then *8* after every second if tasks are
> pending at scheduler backend. So If I don’t want  it 1 sec and I might set
> it to 1 hour [3600 sec] in 2 hour of spark job.
>
>
>
> So this is all about when I want to scale executors dynamically for spark
> job. Is that understanding correct?
>
>
>
> In the below statement I don’t understand much about available partitions
> :-(
>
> *pending tasks (which kinda related to the available partitions)*
>
>
>
>
>
> Regards
>
> Ranju
>
>
>
>
>
> *From:* Attila Zsolt Piros 
> *Sent:* Friday, April 9, 2021 12:13 AM
> *To:* Ranju Jain 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Dynamic Allocation Backlog Property in Spark on Kubernetes
>
>
>
> Hi!
>
> For dynamic allocation you do not need to run the Spark jobs in parallel.
> Dynamic allocation simply means Spark scales up by requesting more
> executors when there are pending tasks (which kinda related to the
> available partitions) and scales down when the executor is idle (as within
> one job the number of partitions can fluctuate).
>
> But if you optimize for run time you can start those jobs in parallel at
> the beginning.
>
> In this case you will use higher number of executors even from the
> beginning.
>
> The "spark.dynamicAllocation.schedulerBacklogTimeout" is not for to
> schedule/synchronize different Spark jobs but it is about tasks.
>
> Best regards,
> Attila
>
>
>
> On Tue, Apr 6, 2021 at 1:59 PM Ranju Jain 
> wrote:
>
> Hi All,
>
>
>
> I have set dynamic allocation enabled while running spark on Kubernetes .
> But new executors are requested if pending tasks are backlogged for more
> than configured duration in property
> *“spark.dynamicAllocation.schedulerBacklogTimeout”*.
>
>
>
> My Use Case is:
>
>
>
> There are number of parallel jobs which might or might not run together at
> a particular point of time. E.g Only One Spark Job may run at a point of
> time or two spark jobs may run at a single point of time depending upon the
> need.
>
> I configured spark.dynamicAllocation.minExecutors as 3 and
> spark.dynamicAllocation.maxExecutors as 8 .
>
>
>
> Steps:
>
>1. SparkContext initialized with 3 executors and First Job requested.
>2. Now, if second job requested after few mins  (e.g 15 mins) , I am
>thinking if I can use the benefit of dynamic allocation and executor should
>scale up to handle second job tasks.
>
> For this I think *“spark.dynamicAllocation.schedulerBacklogTimeout”*
> needs to set after which new executors would be requested.
>
> *Problem: *Problem is there are chances that second job is not requested
> at all or may be requested after 10 mins or after 20 mins. How can I set a
> constant value for
>
> property *“spark.dynamicAllocation.schedulerBacklogTimeout” *to scale the
> executors , when tasks backlog is dependent upon the number of jobs
> requested.
>
>
>
> Regards
>
> Ranju
>
>


RE: Dynamic Allocation Backlog Property in Spark on Kubernetes

2021-04-08 Thread Ranju Jain
Hi Attila,

Thanks for your reply.

If I talk about single job which started to run with minExecutors as 3. And 
Suppose this job [which reads the full data from backend and process and writes 
it to a location]
takes around 2 hour to complete.

What I understood is, as the default value of 
spark.dynamicAllocation.schedulerBacklogTimeout is 1 sec, so executors will 
scale from 3 to 4 and then 8 after every second if tasks are pending at 
scheduler backend. So If I don’t want  it 1 sec and I might set it to 1 hour 
[3600 sec] in 2 hour of spark job.

So this is all about when I want to scale executors dynamically for spark job. 
Is that understanding correct?

In the below statement I don’t understand much about available partitions :-(
pending tasks (which kinda related to the available partitions)


Regards
Ranju


From: Attila Zsolt Piros 
Sent: Friday, April 9, 2021 12:13 AM
To: Ranju Jain 
Cc: user@spark.apache.org
Subject: Re: Dynamic Allocation Backlog Property in Spark on Kubernetes

Hi!

For dynamic allocation you do not need to run the Spark jobs in parallel.
Dynamic allocation simply means Spark scales up by requesting more executors 
when there are pending tasks (which kinda related to the available partitions) 
and scales down when the executor is idle (as within one job the number of 
partitions can fluctuate).

But if you optimize for run time you can start those jobs in parallel at the 
beginning.
In this case you will use higher number of executors even from the beginning.

The "spark.dynamicAllocation.schedulerBacklogTimeout" is not for to 
schedule/synchronize different Spark jobs but it is about tasks.

Best regards,
Attila

On Tue, Apr 6, 2021 at 1:59 PM Ranju Jain 
mailto:ranju.j...@ericsson.com.invalid>> wrote:
Hi All,

I have set dynamic allocation enabled while running spark on Kubernetes . But 
new executors are requested if pending tasks are backlogged for more than 
configured duration in property 
“spark.dynamicAllocation.schedulerBacklogTimeout”.

My Use Case is:

There are number of parallel jobs which might or might not run together at a 
particular point of time. E.g Only One Spark Job may run at a point of time or 
two spark jobs may run at a single point of time depending upon the need.
I configured spark.dynamicAllocation.minExecutors as 3 and 
spark.dynamicAllocation.maxExecutors as 8 .

Steps:

  1.  SparkContext initialized with 3 executors and First Job requested.
  2.  Now, if second job requested after few mins  (e.g 15 mins) , I am 
thinking if I can use the benefit of dynamic allocation and executor should 
scale up to handle second job tasks.

For this I think “spark.dynamicAllocation.schedulerBacklogTimeout” needs to set 
after which new executors would be requested.

Problem: Problem is there are chances that second job is not requested at all 
or may be requested after 10 mins or after 20 mins. How can I set a constant 
value for

property “spark.dynamicAllocation.schedulerBacklogTimeout” to scale the 
executors , when tasks backlog is dependent upon the number of jobs requested.


Regards
Ranju


Re: Dynamic Allocation Backlog Property in Spark on Kubernetes

2021-04-08 Thread Attila Zsolt Piros
Hi!

For dynamic allocation you do not need to run the Spark jobs in parallel.
Dynamic allocation simply means Spark scales up by requesting more
executors when there are pending tasks (which kinda related to the
available partitions) and scales down when the executor is idle (as within
one job the number of partitions can fluctuate).

But if you optimize for run time you can start those jobs in parallel at
the beginning.
In this case you will use higher number of executors even from the
beginning.

The "spark.dynamicAllocation.schedulerBacklogTimeout" is not for to
schedule/synchronize different Spark jobs but it is about tasks.

Best regards,
Attila

On Tue, Apr 6, 2021 at 1:59 PM Ranju Jain 
wrote:

> Hi All,
>
>
>
> I have set dynamic allocation enabled while running spark on Kubernetes .
> But new executors are requested if pending tasks are backlogged for more
> than configured duration in property
> *“spark.dynamicAllocation.schedulerBacklogTimeout”*.
>
>
>
> My Use Case is:
>
>
>
> There are number of parallel jobs which might or might not run together at
> a particular point of time. E.g Only One Spark Job may run at a point of
> time or two spark jobs may run at a single point of time depending upon the
> need.
>
> I configured spark.dynamicAllocation.minExecutors as 3 and
> spark.dynamicAllocation.maxExecutors as 8 .
>
>
>
> Steps:
>
>1. SparkContext initialized with 3 executors and First Job requested.
>2. Now, if second job requested after few mins  (e.g 15 mins) , I am
>thinking if I can use the benefit of dynamic allocation and executor should
>scale up to handle second job tasks.
>
> For this I think *“spark.dynamicAllocation.schedulerBacklogTimeout”*
> needs to set after which new executors would be requested.
>
> *Problem: *Problem is there are chances that second job is not requested
> at all or may be requested after 10 mins or after 20 mins. How can I set a
> constant value for
>
> property *“spark.dynamicAllocation.schedulerBacklogTimeout” *to scale the
> executors , when tasks backlog is dependent upon the number of jobs
> requested.
>
>
>
> Regards
>
> Ranju
>


Dynamic Allocation Backlog Property in Spark on Kubernetes

2021-04-06 Thread Ranju Jain
Hi All,

I have set dynamic allocation enabled while running spark on Kubernetes . But 
new executors are requested if pending tasks are backlogged for more than 
configured duration in property 
"spark.dynamicAllocation.schedulerBacklogTimeout".

My Use Case is:

There are number of parallel jobs which might or might not run together at a 
particular point of time. E.g Only One Spark Job may run at a point of time or 
two spark jobs may run at a single point of time depending upon the need.
I configured spark.dynamicAllocation.minExecutors as 3 and 
spark.dynamicAllocation.maxExecutors as 8 .

Steps:

  1.  SparkContext initialized with 3 executors and First Job requested.
  2.  Now, if second job requested after few mins  (e.g 15 mins) , I am 
thinking if I can use the benefit of dynamic allocation and executor should 
scale up to handle second job tasks.

For this I think "spark.dynamicAllocation.schedulerBacklogTimeout" needs to set 
after which new executors would be requested.

Problem: Problem is there are chances that second job is not requested at all 
or may be requested after 10 mins or after 20 mins. How can I set a constant 
value for

property "spark.dynamicAllocation.schedulerBacklogTimeout" to scale the 
executors , when tasks backlog is dependent upon the number of jobs requested.


Regards
Ranju


RE: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

2021-03-11 Thread Ranju Jain
Ok!

Thanks for all guidance :-)

Regards
Ranju

From: Mich Talebzadeh 
Sent: Thursday, March 11, 2021 11:07 PM
To: Ranju Jain 
Cc: user@spark.apache.org
Subject: Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

I don't have any specific reference. However, you can do a Google search.

best to ask the Unix team. They can do all that themselves.

HTHT





LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw







Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 11 Mar 2021 at 12:53, Ranju Jain 
mailto:ranju.j...@ericsson.com>> wrote:
Yes, there is a Team but I have not contacted them yet.
Trying to understand at my end.

I understood your point you mentioned below:

Do you have any reference or links where I can check out the Shared Volumes ?

Regards
Ranju

From: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Sent: Thursday, March 11, 2021 5:38 PM
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

Well your mileage varies so to speak.


The only way to find out is setting an NFS mount and testing it.



The performance will depend on the mounted file system and the amount of cache 
it has.



File cache is important for reads and if you are going to do random writes (as 
opposed to sequential writes), then you can stripe the volume (RAID 1) for 
better performance.



Do you have a UNIX admin who can help you out as well?



HTH



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw







Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 11 Mar 2021 at 12:01, Ranju Jain 
mailto:ranju.j...@ericsson.com>> wrote:
Hi Mich,

No, it is not Google cloud. It is simply Kubernetes deployed over Bare Metal 
Platform.
I am not clear for pros and cons of Shared Volume vs NFS for Read Write Many.
As NFS is Network File Server [remote] , so I can figure out that Shared Volume 
should be more preferable, but don’t know the other sides [drawback].

Regards
Ranju
From: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Sent: Thursday, March 11, 2021 5:22 PM
To: Ranju Jain 
mailto:ranju.j...@ericsson.com.invalid>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

Ok this is on Google Cloud correct?







LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw







Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 11 Mar 2021 at 11:29, Ranju Jain 
mailto:ranju.j...@ericsson.com.invalid>> wrote:
Hi,

I need to write all Executors pods data on some common location  which can be 
accessed and retrieved by driver pod.
I was first planning to go with NFS, but I think Shared Volume is equally good.
Please suggest Is there any major drawback in using Shared Volume instead of 
NFS when many pods are writing  on the same Volume [ReadWriteMany].

Regards
Ranju


Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

2021-03-11 Thread Mich Talebzadeh
I don't have any specific reference. However, you can do a Google search.

best to ask the Unix team. They can do all that themselves.

HTHT



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 11 Mar 2021 at 12:53, Ranju Jain  wrote:

> Yes, there is a Team but I have not contacted them yet.
>
> Trying to understand at my end.
>
>
>
> I understood your point you mentioned below:
>
>
>
> Do you have any reference or links where I can check out the Shared
> Volumes ?
>
>
>
> Regards
>
> Ranju
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Thursday, March 11, 2021 5:38 PM
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS
>
>
>
> Well your mileage varies so to speak.
>
>
>
> The only way to find out is setting an NFS mount and testing it.
>
>
>
> The performance will depend on the mounted file system and the amount of
> cache it has.
>
>
>
> File cache is important for reads and if you are going to do random writes
> (as opposed to sequential writes), then you can stripe the volume (RAID 1)
> for better performance.
>
>
>
> Do you have a UNIX admin who can help you out as well?
>
>
>
> HTH
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Thu, 11 Mar 2021 at 12:01, Ranju Jain  wrote:
>
> Hi Mich,
>
>
>
> No, it is not Google cloud. It is simply Kubernetes deployed over Bare
> Metal Platform.
>
> I am not clear for pros and cons of Shared Volume vs NFS for Read Write
> Many.
>
> As NFS is Network File Server [remote] , so I can figure out that Shared
> Volume should be more preferable, but don’t know the other sides [drawback].
>
>
>
> Regards
>
> Ranju
>
> *From:* Mich Talebzadeh 
> *Sent:* Thursday, March 11, 2021 5:22 PM
> *To:* Ranju Jain 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS
>
>
>
> Ok this is on Google Cloud correct?
>
>
>
>
>
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Thu, 11 Mar 2021 at 11:29, Ranju Jain 
> wrote:
>
> Hi,
>
>
>
> I need to write all Executors pods data on some common location  which can
> be accessed and retrieved by driver pod.
>
> I was first planning to go with NFS, but I think Shared Volume is equally
> good.
>
> Please suggest Is there any major drawback in using Shared Volume instead
> of NFS when many pods are writing  on the same Volume [ReadWriteMany].
>
>
>
> Regards
>
> Ranju
>
>


RE: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

2021-03-11 Thread Ranju Jain
Yes, there is a Team but I have not contacted them yet.
Trying to understand at my end.

I understood your point you mentioned below:

Do you have any reference or links where I can check out the Shared Volumes ?

Regards
Ranju

From: Mich Talebzadeh 
Sent: Thursday, March 11, 2021 5:38 PM
Cc: user@spark.apache.org
Subject: Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

Well your mileage varies so to speak.


The only way to find out is setting an NFS mount and testing it.



The performance will depend on the mounted file system and the amount of cache 
it has.



File cache is important for reads and if you are going to do random writes (as 
opposed to sequential writes), then you can stripe the volume (RAID 1) for 
better performance.



Do you have a UNIX admin who can help you out as well?



HTH



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw







Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 11 Mar 2021 at 12:01, Ranju Jain 
mailto:ranju.j...@ericsson.com>> wrote:
Hi Mich,

No, it is not Google cloud. It is simply Kubernetes deployed over Bare Metal 
Platform.
I am not clear for pros and cons of Shared Volume vs NFS for Read Write Many.
As NFS is Network File Server [remote] , so I can figure out that Shared Volume 
should be more preferable, but don’t know the other sides [drawback].

Regards
Ranju
From: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Sent: Thursday, March 11, 2021 5:22 PM
To: Ranju Jain 
mailto:ranju.j...@ericsson.com.invalid>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

Ok this is on Google Cloud correct?







LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw







Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 11 Mar 2021 at 11:29, Ranju Jain 
mailto:ranju.j...@ericsson.com.invalid>> wrote:
Hi,

I need to write all Executors pods data on some common location  which can be 
accessed and retrieved by driver pod.
I was first planning to go with NFS, but I think Shared Volume is equally good.
Please suggest Is there any major drawback in using Shared Volume instead of 
NFS when many pods are writing  on the same Volume [ReadWriteMany].

Regards
Ranju


Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

2021-03-11 Thread Mich Talebzadeh
Well your mileage varies so to speak.

The only way to find out is setting an NFS mount and testing it.


The performance will depend on the mounted file system and the amount of
cache it has.


File cache is important for reads and if you are going to do random writes
(as opposed to sequential writes), then you can stripe the volume (RAID 1)
for better performance.


Do you have a UNIX admin who can help you out as well?


HTH


LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 11 Mar 2021 at 12:01, Ranju Jain  wrote:

> Hi Mich,
>
>
>
> No, it is not Google cloud. It is simply Kubernetes deployed over Bare
> Metal Platform.
>
> I am not clear for pros and cons of Shared Volume vs NFS for Read Write
> Many.
>
> As NFS is Network File Server [remote] , so I can figure out that Shared
> Volume should be more preferable, but don’t know the other sides [drawback].
>
>
>
> Regards
>
> Ranju
>
> *From:* Mich Talebzadeh 
> *Sent:* Thursday, March 11, 2021 5:22 PM
> *To:* Ranju Jain 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS
>
>
>
> Ok this is on Google Cloud correct?
>
>
>
>
>
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Thu, 11 Mar 2021 at 11:29, Ranju Jain 
> wrote:
>
> Hi,
>
>
>
> I need to write all Executors pods data on some common location  which can
> be accessed and retrieved by driver pod.
>
> I was first planning to go with NFS, but I think Shared Volume is equally
> good.
>
> Please suggest Is there any major drawback in using Shared Volume instead
> of NFS when many pods are writing  on the same Volume [ReadWriteMany].
>
>
>
> Regards
>
> Ranju
>
>


RE: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

2021-03-11 Thread Ranju Jain
Hi Mich,

No, it is not Google cloud. It is simply Kubernetes deployed over Bare Metal 
Platform.
I am not clear for pros and cons of Shared Volume vs NFS for Read Write Many.
As NFS is Network File Server [remote] , so I can figure out that Shared Volume 
should be more preferable, but don’t know the other sides [drawback].

Regards
Ranju
From: Mich Talebzadeh 
Sent: Thursday, March 11, 2021 5:22 PM
To: Ranju Jain 
Cc: user@spark.apache.org
Subject: Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

Ok this is on Google Cloud correct?







LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw







Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 11 Mar 2021 at 11:29, Ranju Jain 
mailto:ranju.j...@ericsson.com.invalid>> wrote:
Hi,

I need to write all Executors pods data on some common location  which can be 
accessed and retrieved by driver pod.
I was first planning to go with NFS, but I think Shared Volume is equally good.
Please suggest Is there any major drawback in using Shared Volume instead of 
NFS when many pods are writing  on the same Volume [ReadWriteMany].

Regards
Ranju


Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

2021-03-11 Thread Mich Talebzadeh
Ok this is on Google Cloud correct?




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 11 Mar 2021 at 11:29, Ranju Jain 
wrote:

> Hi,
>
>
>
> I need to write all Executors pods data on some common location  which can
> be accessed and retrieved by driver pod.
>
> I was first planning to go with NFS, but I think Shared Volume is equally
> good.
>
> Please suggest Is there any major drawback in using Shared Volume instead
> of NFS when many pods are writing  on the same Volume [ReadWriteMany].
>
>
>
> Regards
>
> Ranju
>


Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

2021-03-11 Thread Ranju Jain
Hi,

I need to write all Executors pods data on some common location  which can be 
accessed and retrieved by driver pod.
I was first planning to go with NFS, but I think Shared Volume is equally good.
Please suggest Is there any major drawback in using Shared Volume instead of 
NFS when many pods are writing  on the same Volume [ReadWriteMany].

Regards
Ranju


Re: vm.swappiness value for Spark on Kubernetes

2021-02-16 Thread Sean Owen
You probably don't want swapping in any environment. Some tasks will grind
to a halt under mem pressure rather than just fail quickly. You would want
to simply provision more memory.

On Tue, Feb 16, 2021, 7:57 AM Jahar Tyagi  wrote:

> Hi,
>
> We have recently migrated from Spark 2.4.4 to Spark 3.0.1 and using Spark
> in virtual machine/bare metal as standalone deployment and as kubernetes
> deployment as well.
>
> There is a kernel parameter named as 'vm.swappiness' and we keep its value
> as '1' in standard deployment. Now since we are moving to kubernetes and on
> kubernetes worker nodes the value of this parameter is '60'.
>
> Now my question is if it is OK to keep such a high value of
> 'vm.swappiness'=60 in kubernetes environment for Spark workloads.
>
> Will such high value of this kernel parameter have performance impact on
> Spark PODs?
> As per below link from cloudera, they suggest not to set such a high
> value.
>
>
> https://docs.cloudera.com/cloudera-manager/7.2.6/managing-clusters/topics/cm-setting-vmswappiness-linux-kernel-parameter.html
>
> Any thoughts/suggestions on this are highly appreciated.
>
> Regards
> Jahar Tyagi
>
>


vm.swappiness value for Spark on Kubernetes

2021-02-16 Thread Jahar Tyagi
Hi,

We have recently migrated from Spark 2.4.4 to Spark 3.0.1 and using Spark
in virtual machine/bare metal as standalone deployment and as kubernetes
deployment as well.

There is a kernel parameter named as 'vm.swappiness' and we keep its value
as '1' in standard deployment. Now since we are moving to kubernetes and on
kubernetes worker nodes the value of this parameter is '60'.

Now my question is if it is OK to keep such a high value of
'vm.swappiness'=60 in kubernetes environment for Spark workloads.

Will such high value of this kernel parameter have performance impact on
Spark PODs?
As per below link from cloudera, they suggest not to set such a high value.

https://docs.cloudera.com/cloudera-manager/7.2.6/managing-clusters/topics/cm-setting-vmswappiness-linux-kernel-parameter.html

Any thoughts/suggestions on this are highly appreciated.

Regards
Jahar Tyagi


[Spark on Kubernetes] Spark Application dependency management Question.

2021-02-03 Thread xgong
Hey Team:

 Currently, we were upgrading the spark version from 2.4 to 3.0. But we
found that the applications, which work in spark 2.4, keep failing with
Spark 3.0. We are running Spark on Kubernetes with cluster mode. 

  In spark-submit, we have "--jars local:///apps-dep/spark-extra-jars/*". It
is fine when we are using spark 2.4.5 image, but when we try to submit the
same application using spark 3.0 image. The driver always fails. First, it
complains "WARN DependencyUtils: Local jar /apps-dep/spark-extra-jars/* does
not exist, skipping.". Then the driver fails with the exception "Exception
in thread "main" org.apache.spark.SparkException: Job aborted due to stage
failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task
0.3 in stage 0.0 (TID 6, 10.98.179.228, executor 1):
java.nio.file.NoSuchFileException: /apps-dep/spark-extra-jars/*". But I can
make sure that all of our dependency jars does exist in
/apps-dep/spark-extra-jars in the docker image for both driver and workers.
And using spark 2.4.5, it works fine.
  
   Could you give me a hint on how to debug it and what is going on here?
   
   Also, I do not understand the following behaviors:
   * if I changed the --jar parameter value from "local:///" to "file:///".
Using spark 3.0, it works.
   * if I use "--jars local:///apps-dep/spark-extra-jars/app.jar", the
submission would fail with the exception "Exception in thread "main"
org.apache.spark.SparkException: Please specify
spark.kubernetes.file.upload.path property." which makes sense according to
the spark 3.0 doc. But if I use "--jar ///apps-dep/spark-extra-jars/*", the
submission and the application will run successfully. Could you help me to
understand why it is fine to use "*" instead of the specific jar file?


Thank you very much

Xuan Gong
  




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread Loic DESCOTTE
Everything is working fine now 
Thanks again

Loïc

De : German Schiavon 
Envoyé : mercredi 16 décembre 2020 19:23
À : Loic DESCOTTE 
Cc : user@spark.apache.org 
Objet : Re: Spark on Kubernetes : unable to write files to HDFS

We all been there! no reason to be ashamed :)

On Wed, 16 Dec 2020 at 18:14, Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>> 
wrote:
Oh thank you you're right!! I feel shameful 


De : German Schiavon mailto:gschiavonsp...@gmail.com>>
Envoyé : mercredi 16 décembre 2020 18:01
À : Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>>
Cc : user@spark.apache.org<mailto:user@spark.apache.org> 
mailto:user@spark.apache.org>>
Objet : Re: Spark on Kubernetes : unable to write files to HDFS

Hi,

seems that you have a typo no?

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hfds

  
data.write.mode("overwrite").format("text").save("hfds://hdfs-namenode/user/loic/result.txt")


On Wed, 16 Dec 2020 at 17:02, Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>> 
wrote:
So I've tried several other things, including building a fat jar with hdfs 
dependency inside my app jar, and added this to the Spark configuration in the 
code :

val spark = SparkSession
  .builder()
  .appName("Hello Spark 7")
  .config("fs.hdfs.impl", 
classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
  .getOrCreate()


But still the same error...


De : Sean Owen mailto:sro...@gmail.com>>
Envoyé : mercredi 16 décembre 2020 14:27
À : Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>>
Objet : Re: Spark on Kubernetes : unable to write files to HDFS

I think it'll have to be part of the Spark distro, but I'm not 100% sure. I 
also think these get registered via manifest files in the JARs; if some process 
is stripping those when creating a bundled up JAR, could be it. Could be that 
it's failing to initialize too for some reason.

On Wed, Dec 16, 2020 at 7:24 AM Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>> 
wrote:
I've tried with this spark-submit option :

--packages 
org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \

But it did't solve the issue.
Should I add more jars?

Thanks
Loïc

De : Sean Owen mailto:sro...@gmail.com>>
Envoyé : mercredi 16 décembre 2020 14:20
À : Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>>
Objet : Re: Spark on Kubernetes : unable to write files to HDFS

Seems like your Spark cluster doesn't somehow have the Hadoop JARs?

On Wed, Dec 16, 2020 at 6:45 AM Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>> 
wrote:
Hello,

I am using Spark On Kubernetes and I have the following error when I try to 
write data on HDFS : "no filesystem for scheme hdfs"

More details :

I am submitting my application with Spark submit like this :

spark-submit --master k8s://https://myK8SMaster:6443 \
--deploy-mode cluster \
--name hello-spark \
--class Hello \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=gradiant/spark:2.4.4 
hdfs://hdfs-namenode/user/loic/jars/helloSpark.jar

Then the driver and the 2 executors are created in K8S.

But it fails when I look at the logs of the driver, I see this :

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hfds
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at 
org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:424)
at 
org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:524)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at Hello$.main(hello.scala:24)
at Hello.main(hello.scala)


As you can see , my application jar helloSpark.jar file is correctly loaded on 
HDFS by the Spark submit, but writing to HDFS fails.

I have also tried to add the hadoop client dand hdfs dependencies in the spark 
submit command:

--packages 
org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:

Re: Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread German Schiavon
We all been there! no reason to be ashamed :)

On Wed, 16 Dec 2020 at 18:14, Loic DESCOTTE <
loic.desco...@kaizen-solutions.net> wrote:

> Oh thank you you're right!! I feel shameful 
>
> --
> *De :* German Schiavon 
> *Envoyé :* mercredi 16 décembre 2020 18:01
> *À :* Loic DESCOTTE 
> *Cc :* user@spark.apache.org 
> *Objet :* Re: Spark on Kubernetes : unable to write files to HDFS
>
> Hi,
>
> seems that you have a typo no?
>
> Exception in thread "main" java.io.IOException: No FileSystem for scheme:
> hfds
>
>   data.write.mode("overwrite").format("text").save("hfds://
> hdfs-namenode/user/loic/result.txt")
>
>
> On Wed, 16 Dec 2020 at 17:02, Loic DESCOTTE <
> loic.desco...@kaizen-solutions.net> wrote:
>
> So I've tried several other things, including building a fat jar with hdfs
> dependency inside my app jar, and added this to the Spark configuration in
> the code :
>
> val spark = SparkSession
>   .builder()
>   .appName("Hello Spark 7")
>   .config("fs.hdfs.impl", classOf[org.apache.hadoop.hdfs.
> DistributedFileSystem].getName)
>   .getOrCreate()
>
>
> But still the same error...
>
> --
> *De :* Sean Owen 
> *Envoyé :* mercredi 16 décembre 2020 14:27
> *À :* Loic DESCOTTE 
> *Objet :* Re: Spark on Kubernetes : unable to write files to HDFS
>
> I think it'll have to be part of the Spark distro, but I'm not 100% sure.
> I also think these get registered via manifest files in the JARs; if some
> process is stripping those when creating a bundled up JAR, could be it.
> Could be that it's failing to initialize too for some reason.
>
> On Wed, Dec 16, 2020 at 7:24 AM Loic DESCOTTE <
> loic.desco...@kaizen-solutions.net> wrote:
>
> I've tried with this spark-submit option :
>
> --packages
> org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \
>
> But it did't solve the issue.
> Should I add more jars?
>
> Thanks
> Loïc
> --
> *De :* Sean Owen 
> *Envoyé :* mercredi 16 décembre 2020 14:20
> *À :* Loic DESCOTTE 
> *Objet :* Re: Spark on Kubernetes : unable to write files to HDFS
>
> Seems like your Spark cluster doesn't somehow have the Hadoop JARs?
>
> On Wed, Dec 16, 2020 at 6:45 AM Loic DESCOTTE <
> loic.desco...@kaizen-solutions.net> wrote:
>
> Hello,
>
> I am using Spark On Kubernetes and I have the following error when I try
> to write data on HDFS : "no filesystem for scheme hdfs"
>
> More details :
>
> I am submitting my application with Spark submit like this :
>
> spark-submit --master k8s://https://myK8SMaster:6443 \
> --deploy-mode cluster \
> --name hello-spark \
> --class Hello \
> --conf spark.executor.instances=2 \
> --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.container.image=gradiant/spark:2.4.4
> hdfs://hdfs-namenode/user/loic/jars/helloSpark.jar
>
> Then the driver and the 2 executors are created in K8S.
>
> But it fails when I look at the logs of the driver, I see this :
>
> Exception in thread "main" java.io.IOException: No FileSystem for scheme:
> hfds
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at
> org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:424)
> at
> org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:524)
> at
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
> at Hello$.main(hello.scala:24)
> at Hello.main(hello.scala)
>
>
> As you can see , my application jar helloSpark.jar file is correctly
> loaded on HDFS by the Spark submit, but writing to HDFS fails.
>
> I have also tried to add the hadoop client dand hdfs dependencies in the
> spark submit command:
>
> --packages
> org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \
&

RE: Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread Loic DESCOTTE
Oh thank you you're right!! I feel shameful ??


De : German Schiavon 
Envoyé : mercredi 16 décembre 2020 18:01
À : Loic DESCOTTE 
Cc : user@spark.apache.org 
Objet : Re: Spark on Kubernetes : unable to write files to HDFS

Hi,

seems that you have a typo no?

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hfds

  
data.write.mode("overwrite").format("text").save("hfds://hdfs-namenode/user/loic/result.txt")


On Wed, 16 Dec 2020 at 17:02, Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>> 
wrote:
So I've tried several other things, including building a fat jar with hdfs 
dependency inside my app jar, and added this to the Spark configuration in the 
code :

val spark = SparkSession
  .builder()
  .appName("Hello Spark 7")
  .config("fs.hdfs.impl", 
classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
  .getOrCreate()


But still the same error...


De : Sean Owen mailto:sro...@gmail.com>>
Envoyé : mercredi 16 décembre 2020 14:27
À : Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>>
Objet : Re: Spark on Kubernetes : unable to write files to HDFS

I think it'll have to be part of the Spark distro, but I'm not 100% sure. I 
also think these get registered via manifest files in the JARs; if some process 
is stripping those when creating a bundled up JAR, could be it. Could be that 
it's failing to initialize too for some reason.

On Wed, Dec 16, 2020 at 7:24 AM Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>> 
wrote:
I've tried with this spark-submit option :

--packages 
org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \

But it did't solve the issue.
Should I add more jars?

Thanks
Loïc

De : Sean Owen mailto:sro...@gmail.com>>
Envoyé : mercredi 16 décembre 2020 14:20
À : Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>>
Objet : Re: Spark on Kubernetes : unable to write files to HDFS

Seems like your Spark cluster doesn't somehow have the Hadoop JARs?

On Wed, Dec 16, 2020 at 6:45 AM Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>> 
wrote:
Hello,

I am using Spark On Kubernetes and I have the following error when I try to 
write data on HDFS : "no filesystem for scheme hdfs"

More details :

I am submitting my application with Spark submit like this :

spark-submit --master k8s://https://myK8SMaster:6443 \
--deploy-mode cluster \
--name hello-spark \
--class Hello \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=gradiant/spark:2.4.4 
hdfs://hdfs-namenode/user/loic/jars/helloSpark.jar

Then the driver and the 2 executors are created in K8S.

But it fails when I look at the logs of the driver, I see this :

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hfds
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at 
org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:424)
at 
org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:524)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at Hello$.main(hello.scala:24)
at Hello.main(hello.scala)


As you can see , my application jar helloSpark.jar file is correctly loaded on 
HDFS by the Spark submit, but writing to HDFS fails.

I have also tried to add the hadoop client dand hdfs dependencies in the spark 
submit command:

--packages 
org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \

But the error is still here.


Here is the Scala code of my application :


import java.util.Calendar

import org.apache.spark.sql.SparkSession

case class Data(singleField: String)

object Hello
{
def main(args: Array[String])
{

val spark = SparkSession
  .builder()
  .appName("Hello Spark")
  .getOrCreate()

import spark.implicits._

val now = Calendar.getInstance().getTime().toString
val data = List(Data(now)).toDF()

data.write.mode("overwrite").format("text").save("hfds://hdfs-namenode/user/loic/result.txt")
}
}

Thanks for your help,
Loïc


Re: Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread German Schiavon
Hi,

seems that you have a typo no?

Exception in thread "main" java.io.IOException: No FileSystem for scheme:
hfds

  data.write.mode("overwrite").format("text").save("hfds://
hdfs-namenode/user/loic/result.txt")


On Wed, 16 Dec 2020 at 17:02, Loic DESCOTTE <
loic.desco...@kaizen-solutions.net> wrote:

> So I've tried several other things, including building a fat jar with hdfs
> dependency inside my app jar, and added this to the Spark configuration in
> the code :
>
> val spark = SparkSession
>   .builder()
>   .appName("Hello Spark 7")
>   .config("fs.hdfs.impl", classOf[org.apache.hadoop.hdfs.
> DistributedFileSystem].getName)
>   .getOrCreate()
>
>
> But still the same error...
>
> --
> *De :* Sean Owen 
> *Envoyé :* mercredi 16 décembre 2020 14:27
> *À :* Loic DESCOTTE 
> *Objet :* Re: Spark on Kubernetes : unable to write files to HDFS
>
> I think it'll have to be part of the Spark distro, but I'm not 100% sure.
> I also think these get registered via manifest files in the JARs; if some
> process is stripping those when creating a bundled up JAR, could be it.
> Could be that it's failing to initialize too for some reason.
>
> On Wed, Dec 16, 2020 at 7:24 AM Loic DESCOTTE <
> loic.desco...@kaizen-solutions.net> wrote:
>
> I've tried with this spark-submit option :
>
> --packages
> org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \
>
> But it did't solve the issue.
> Should I add more jars?
>
> Thanks
> Loïc
> --
> *De :* Sean Owen 
> *Envoyé :* mercredi 16 décembre 2020 14:20
> *À :* Loic DESCOTTE 
> *Objet :* Re: Spark on Kubernetes : unable to write files to HDFS
>
> Seems like your Spark cluster doesn't somehow have the Hadoop JARs?
>
> On Wed, Dec 16, 2020 at 6:45 AM Loic DESCOTTE <
> loic.desco...@kaizen-solutions.net> wrote:
>
> Hello,
>
> I am using Spark On Kubernetes and I have the following error when I try
> to write data on HDFS : "no filesystem for scheme hdfs"
>
> More details :
>
> I am submitting my application with Spark submit like this :
>
> spark-submit --master k8s://https://myK8SMaster:6443 \
> --deploy-mode cluster \
> --name hello-spark \
> --class Hello \
> --conf spark.executor.instances=2 \
> --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.container.image=gradiant/spark:2.4.4
> hdfs://hdfs-namenode/user/loic/jars/helloSpark.jar
>
> Then the driver and the 2 executors are created in K8S.
>
> But it fails when I look at the logs of the driver, I see this :
>
> Exception in thread "main" java.io.IOException: No FileSystem for scheme:
> hfds
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at
> org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:424)
> at
> org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:524)
> at
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
> at Hello$.main(hello.scala:24)
> at Hello.main(hello.scala)
>
>
> As you can see , my application jar helloSpark.jar file is correctly
> loaded on HDFS by the Spark submit, but writing to HDFS fails.
>
> I have also tried to add the hadoop client dand hdfs dependencies in the
> spark submit command:
>
> --packages
> org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \
>
> But the error is still here.
>
>
> Here is the Scala code of my application :
>
>
> import java.util.Calendar
>
> import org.apache.spark.sql.SparkSession
>
> case class Data(singleField: String)
>
> object Hello
> {
> def main(args: Array[String])
> {
>
> val spark = SparkSession
>   .builder()
>   .appName("Hello Spark")
>   .getOrCreate()
>
> import spark.implicits._
>
> val now = Calendar.getInstance().getTime().toString
> val data = List(Data(now)).toDF()
>
> data.write.mode("overwrite").format("text").save("hfds://hdfs-namenode/user/loic/result.txt")
> }
> }
>
> Thanks for your help,
> Loïc
>
>


RE: Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread Loic DESCOTTE
So I've tried several other things, including building a fat jar with hdfs 
dependency inside my app jar, and added this to the Spark configuration in the 
code :

val spark = SparkSession
  .builder()
  .appName("Hello Spark 7")
  .config("fs.hdfs.impl", 
classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
  .getOrCreate()


But still the same error...


De : Sean Owen 
Envoyé : mercredi 16 décembre 2020 14:27
À : Loic DESCOTTE 
Objet : Re: Spark on Kubernetes : unable to write files to HDFS

I think it'll have to be part of the Spark distro, but I'm not 100% sure. I 
also think these get registered via manifest files in the JARs; if some process 
is stripping those when creating a bundled up JAR, could be it. Could be that 
it's failing to initialize too for some reason.

On Wed, Dec 16, 2020 at 7:24 AM Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>> 
wrote:
I've tried with this spark-submit option :

--packages 
org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \

But it did't solve the issue.
Should I add more jars?

Thanks
Loïc

De : Sean Owen mailto:sro...@gmail.com>>
Envoyé : mercredi 16 décembre 2020 14:20
À : Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>>
Objet : Re: Spark on Kubernetes : unable to write files to HDFS

Seems like your Spark cluster doesn't somehow have the Hadoop JARs?

On Wed, Dec 16, 2020 at 6:45 AM Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>> 
wrote:
Hello,

I am using Spark On Kubernetes and I have the following error when I try to 
write data on HDFS : "no filesystem for scheme hdfs"

More details :

I am submitting my application with Spark submit like this :

spark-submit --master k8s://https://myK8SMaster:6443 \
--deploy-mode cluster \
--name hello-spark \
--class Hello \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=gradiant/spark:2.4.4 
hdfs://hdfs-namenode/user/loic/jars/helloSpark.jar

Then the driver and the 2 executors are created in K8S.

But it fails when I look at the logs of the driver, I see this :

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hfds
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at 
org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:424)
at 
org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:524)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at Hello$.main(hello.scala:24)
at Hello.main(hello.scala)


As you can see , my application jar helloSpark.jar file is correctly loaded on 
HDFS by the Spark submit, but writing to HDFS fails.

I have also tried to add the hadoop client dand hdfs dependencies in the spark 
submit command:

--packages 
org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \

But the error is still here.


Here is the Scala code of my application :


import java.util.Calendar

import org.apache.spark.sql.SparkSession

case class Data(singleField: String)

object Hello
{
def main(args: Array[String])
{

val spark = SparkSession
  .builder()
  .appName("Hello Spark")
  .getOrCreate()

import spark.implicits._

val now = Calendar.getInstance().getTime().toString
val data = List(Data(now)).toDF()

data.write.mode("overwrite").format("text").save("hfds://hdfs-namenode/user/loic/result.txt")
}
}

Thanks for your help,
Loïc


Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread Loic DESCOTTE
Hello,

I am using Spark On Kubernetes and I have the following error when I try to 
write data on HDFS : "no filesystem for scheme hdfs"

More details :

I am submitting my application with Spark submit like this :

spark-submit --master k8s://https://myK8SMaster:6443 \
--deploy-mode cluster \
--name hello-spark \
--class Hello \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=gradiant/spark:2.4.4 
hdfs://hdfs-namenode/user/loic/jars/helloSpark.jar

Then the driver and the 2 executors are created in K8S.

But it fails when I look at the logs of the driver, I see this :

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hfds
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at 
org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:424)
at 
org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:524)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at Hello$.main(hello.scala:24)
at Hello.main(hello.scala)


As you can see , my application jar helloSpark.jar file is correctly loaded on 
HDFS by the Spark submit, but writing to HDFS fails.

I have also tried to add the hadoop client dand hdfs dependencies in the spark 
submit command:

--packages 
org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \

But the error is still here.


Here is the Scala code of my application :


import java.util.Calendar

import org.apache.spark.sql.SparkSession

case class Data(singleField: String)

object Hello
{
def main(args: Array[String])
{

val spark = SparkSession
  .builder()
  .appName("Hello Spark")
  .getOrCreate()

import spark.implicits._

val now = Calendar.getInstance().getTime().toString
val data = List(Data(now)).toDF()

data.write.mode("overwrite").format("text").save("hfds://hdfs-namenode/user/loic/result.txt")
}
}

Thanks for your help,
Loïc


Spark on Kubernetes

2020-11-13 Thread Arti Pande
Hi,

Is it recommended to use Spark on K8S in production?

Spark operator for Kubernetes seems to be in beta state.
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator#:~:text=The%20Kubernetes%20Operator%20for%20Apache%20Spark%20aims%20to%20make%20specifying,surfacing%20status%20of%20Spark%20applications
.

Apparently there are open JIRA issues for spark that talk about problems on
K8s with dynamic allocation etc. Are there any workarounds? Are there any
other issues? What is the official recommendation?

Thanks & regards,
Arti Pande


Re: Hive on Spark in Kubernetes.

2020-10-07 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Thank you very much!


Отправлено с iPhone

> 7 окт. 2020 г., в 17:38, mykidong  написал(а):
> 
> Hi all,
> 
> I have recently written a blog about hive on spark in kubernetes
> environment:
> - https://itnext.io/hive-on-spark-in-kubernetes-115c8e9fa5c1
> 
> In this blog, you can find how to run hive on kubernetes using spark thrift
> server compatible with hive server2.
> 
> Cheers,
> 
> - Kidong.
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Hive on Spark in Kubernetes.

2020-10-07 Thread mykidong
Hi all,

I have recently written a blog about hive on spark in kubernetes
environment:
- https://itnext.io/hive-on-spark-in-kubernetes-115c8e9fa5c1

In this blog, you can find how to run hive on kubernetes using spark thrift
server compatible with hive server2.

Cheers,

- Kidong.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

2020-07-12 Thread Prashant Sharma
Driver HA, is not yet available in k8s mode. It can be a good area, to
work. I want to take a look at it. I personally refer to spark official
documentation for reference.
Thanks,



On Fri, Jul 10, 2020, 9:30 PM Varshney, Vaibhav <
vaibhav.varsh...@siemens.com> wrote:

> Hi Prashant,
>
>
>
> It sounds encouraging. During scale down of the cluster, probably few of
> the spark jobs are impacted due to re-computation of shuffle data. This is
> not of supreme importance for us for now.
>
> Is there any reference deployment architecture available, which is HA ,
> scalable and dynamic-allocation-enabled for deploying Spark on K8s? Any
> suggested github repo or link?
>
>
>
> Thanks,
>
> Vaibhav V
>
>
>
>
>
> *From:* Prashant Sharma 
> *Sent:* Friday, July 10, 2020 12:57 AM
> *To:* user@spark.apache.org
> *Cc:* Sean Owen ; Ramani, Sai (DI SW CAS MP AFC ARC) <
> sai.ram...@siemens.com>; Varshney, Vaibhav (DI SW CAS MP AFC ARC) <
> vaibhav.varsh...@siemens.com>
> *Subject:* Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production
> deployment
>
>
>
> Hi,
>
>
>
> Whether it is a blocker or not, is upto you to decide. But, spark k8s
> cluster supports dynamic allocation, through a different mechanism, that
> is, without using an external shuffle service.
> https://issues.apache.org/jira/browse/SPARK-27963. There are pros and
> cons of both approaches. The only disadvantage of scaling without external
> shuffle service is, when the cluster scales down or it loses executors due
> to some external cause ( for example losing spot instances), we lose the
> shuffle data (data that was computed as an intermediate to some overall
> computation) on that executor. This situation may not lead to data loss, as
> spark can recompute the lost shuffle data.
>
>
>
> Dynamically, scaling up and down scaling, is helpful when the spark
> cluster is running off, "spot instances on AWS" for example or when the
> size of data is not known in advance. In other words, we cannot estimate
> how much resources would be needed to process the data. Dynamic scaling,
> lets the cluster increase its size only based on the number of pending
> tasks, currently this is the only metric implemented.
>
>
>
> I don't think it is a blocker for my production use cases.
>
>
>
> Thanks,
>
> Prashant
>
>
>
> On Fri, Jul 10, 2020 at 2:06 AM Varshney, Vaibhav <
> vaibhav.varsh...@siemens.com> wrote:
>
> Thanks for response. We have tried it in dev env. For production, if Spark
> 3.0 is not leveraging k8s scheduler, then would Spark Cluster in K8s be
> "static"?
> As per https://issues.apache.org/jira/browse/SPARK-24432 it seems it is
> still blocker for production workloads?
>
> Thanks,
> Vaibhav V
>
> -Original Message-
> From: Sean Owen 
> Sent: Thursday, July 9, 2020 3:20 PM
> To: Varshney, Vaibhav (DI SW CAS MP AFC ARC)  >
> Cc: user@spark.apache.org; Ramani, Sai (DI SW CAS MP AFC ARC) <
> sai.ram...@siemens.com>
> Subject: Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production
> deployment
>
> I haven't used the K8S scheduler personally, but, just based on that
> comment I wouldn't worry too much. It's been around for several versions
> and AFAIK works fine in general. We sometimes aren't so great about
> removing "experimental" labels. That said I know there are still some
> things that could be added to it and more work going on, and maybe people
> closer to that work can comment. But yeah you shouldn't be afraid to try it.
>
> On Thu, Jul 9, 2020 at 3:18 PM Varshney, Vaibhav <
> vaibhav.varsh...@siemens.com> wrote:
> >
> > Hi Spark Experts,
> >
> >
> >
> > We are trying to deploy spark on Kubernetes.
> >
> > As per doc
> http://spark.apache.org/docs/latest/running-on-kubernetes.html, it looks
> like K8s deployment is experimental.
> >
> > "The Kubernetes scheduler is currently experimental ".
> >
> >
> >
> > Spark 3.0 does not support production deployment using k8s scheduler?
> >
> > What’s the plan on full support of K8s scheduler?
> >
> >
> >
> > Thanks,
> >
> > Vaibhav V
>
>


RE: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

2020-07-10 Thread Varshney, Vaibhav
Hi Prashant,

It sounds encouraging. During scale down of the cluster, probably few of the 
spark jobs are impacted due to re-computation of shuffle data. This is not of 
supreme importance for us for now.
Is there any reference deployment architecture available, which is HA , 
scalable and dynamic-allocation-enabled for deploying Spark on K8s? Any 
suggested github repo or link?

Thanks,
Vaibhav V


From: Prashant Sharma 
Sent: Friday, July 10, 2020 12:57 AM
To: user@spark.apache.org
Cc: Sean Owen ; Ramani, Sai (DI SW CAS MP AFC ARC) 
; Varshney, Vaibhav (DI SW CAS MP AFC ARC) 

Subject: Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

Hi,

Whether it is a blocker or not, is upto you to decide. But, spark k8s cluster 
supports dynamic allocation, through a different mechanism, that is, without 
using an external shuffle service. 
https://issues.apache.org/jira/browse/SPARK-27963. There are pros and cons of 
both approaches. The only disadvantage of scaling without external shuffle 
service is, when the cluster scales down or it loses executors due to some 
external cause ( for example losing spot instances), we lose the shuffle data 
(data that was computed as an intermediate to some overall computation) on that 
executor. This situation may not lead to data loss, as spark can recompute the 
lost shuffle data.

Dynamically, scaling up and down scaling, is helpful when the spark cluster is 
running off, "spot instances on AWS" for example or when the size of data is 
not known in advance. In other words, we cannot estimate how much resources 
would be needed to process the data. Dynamic scaling, lets the cluster increase 
its size only based on the number of pending tasks, currently this is the only 
metric implemented.

I don't think it is a blocker for my production use cases.

Thanks,
Prashant

On Fri, Jul 10, 2020 at 2:06 AM Varshney, Vaibhav 
mailto:vaibhav.varsh...@siemens.com>> wrote:
Thanks for response. We have tried it in dev env. For production, if Spark 3.0 
is not leveraging k8s scheduler, then would Spark Cluster in K8s be "static"?
As per https://issues.apache.org/jira/browse/SPARK-24432 it seems it is still 
blocker for production workloads?

Thanks,
Vaibhav V

-Original Message-
From: Sean Owen mailto:sro...@gmail.com>>
Sent: Thursday, July 9, 2020 3:20 PM
To: Varshney, Vaibhav (DI SW CAS MP AFC ARC) 
mailto:vaibhav.varsh...@siemens.com>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>; Ramani, Sai (DI SW CAS 
MP AFC ARC) mailto:sai.ram...@siemens.com>>
Subject: Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

I haven't used the K8S scheduler personally, but, just based on that comment I 
wouldn't worry too much. It's been around for several versions and AFAIK works 
fine in general. We sometimes aren't so great about removing "experimental" 
labels. That said I know there are still some things that could be added to it 
and more work going on, and maybe people closer to that work can comment. But 
yeah you shouldn't be afraid to try it.

On Thu, Jul 9, 2020 at 3:18 PM Varshney, Vaibhav 
mailto:vaibhav.varsh...@siemens.com>> wrote:
>
> Hi Spark Experts,
>
>
>
> We are trying to deploy spark on Kubernetes.
>
> As per doc http://spark.apache.org/docs/latest/running-on-kubernetes.html, it 
> looks like K8s deployment is experimental.
>
> "The Kubernetes scheduler is currently experimental ".
>
>
>
> Spark 3.0 does not support production deployment using k8s scheduler?
>
> What’s the plan on full support of K8s scheduler?
>
>
>
> Thanks,
>
> Vaibhav V


Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

2020-07-09 Thread Prashant Sharma
Hi,

Whether it is a blocker or not, is upto you to decide. But, spark k8s
cluster supports dynamic allocation, through a different mechanism, that
is, without using an external shuffle service.
https://issues.apache.org/jira/browse/SPARK-27963. There are pros and cons
of both approaches. The only disadvantage of scaling without external
shuffle service is, when the cluster scales down or it loses executors due
to some external cause ( for example losing spot instances), we lose the
shuffle data (data that was computed as an intermediate to some overall
computation) on that executor. This situation may not lead to data loss, as
spark can recompute the lost shuffle data.

Dynamically, scaling up and down scaling, is helpful when the spark cluster
is running off, "spot instances on AWS" for example or when the size of
data is not known in advance. In other words, we cannot estimate how much
resources would be needed to process the data. Dynamic scaling, lets the
cluster increase its size only based on the number of pending tasks,
currently this is the only metric implemented.

I don't think it is a blocker for my production use cases.

Thanks,
Prashant

On Fri, Jul 10, 2020 at 2:06 AM Varshney, Vaibhav <
vaibhav.varsh...@siemens.com> wrote:

> Thanks for response. We have tried it in dev env. For production, if Spark
> 3.0 is not leveraging k8s scheduler, then would Spark Cluster in K8s be
> "static"?
> As per https://issues.apache.org/jira/browse/SPARK-24432 it seems it is
> still blocker for production workloads?
>
> Thanks,
> Vaibhav V
>
> -Original Message-
> From: Sean Owen 
> Sent: Thursday, July 9, 2020 3:20 PM
> To: Varshney, Vaibhav (DI SW CAS MP AFC ARC)  >
> Cc: user@spark.apache.org; Ramani, Sai (DI SW CAS MP AFC ARC) <
> sai.ram...@siemens.com>
> Subject: Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production
> deployment
>
> I haven't used the K8S scheduler personally, but, just based on that
> comment I wouldn't worry too much. It's been around for several versions
> and AFAIK works fine in general. We sometimes aren't so great about
> removing "experimental" labels. That said I know there are still some
> things that could be added to it and more work going on, and maybe people
> closer to that work can comment. But yeah you shouldn't be afraid to try it.
>
> On Thu, Jul 9, 2020 at 3:18 PM Varshney, Vaibhav <
> vaibhav.varsh...@siemens.com> wrote:
> >
> > Hi Spark Experts,
> >
> >
> >
> > We are trying to deploy spark on Kubernetes.
> >
> > As per doc
> http://spark.apache.org/docs/latest/running-on-kubernetes.html, it looks
> like K8s deployment is experimental.
> >
> > "The Kubernetes scheduler is currently experimental ".
> >
> >
> >
> > Spark 3.0 does not support production deployment using k8s scheduler?
> >
> > What’s the plan on full support of K8s scheduler?
> >
> >
> >
> > Thanks,
> >
> > Vaibhav V
>


RE: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

2020-07-09 Thread Varshney, Vaibhav
Thanks for response. We have tried it in dev env. For production, if Spark 3.0 
is not leveraging k8s scheduler, then would Spark Cluster in K8s be "static"? 
As per https://issues.apache.org/jira/browse/SPARK-24432 it seems it is still 
blocker for production workloads?

Thanks,
Vaibhav V

-Original Message-
From: Sean Owen  
Sent: Thursday, July 9, 2020 3:20 PM
To: Varshney, Vaibhav (DI SW CAS MP AFC ARC) 
Cc: user@spark.apache.org; Ramani, Sai (DI SW CAS MP AFC ARC) 

Subject: Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

I haven't used the K8S scheduler personally, but, just based on that comment I 
wouldn't worry too much. It's been around for several versions and AFAIK works 
fine in general. We sometimes aren't so great about removing "experimental" 
labels. That said I know there are still some things that could be added to it 
and more work going on, and maybe people closer to that work can comment. But 
yeah you shouldn't be afraid to try it.

On Thu, Jul 9, 2020 at 3:18 PM Varshney, Vaibhav  
wrote:
>
> Hi Spark Experts,
>
>
>
> We are trying to deploy spark on Kubernetes.
>
> As per doc http://spark.apache.org/docs/latest/running-on-kubernetes.html, it 
> looks like K8s deployment is experimental.
>
> "The Kubernetes scheduler is currently experimental ".
>
>
>
> Spark 3.0 does not support production deployment using k8s scheduler?
>
> What’s the plan on full support of K8s scheduler?
>
>
>
> Thanks,
>
> Vaibhav V


Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

2020-07-09 Thread Sean Owen
I haven't used the K8S scheduler personally, but, just based on that
comment I wouldn't worry too much. It's been around for several
versions and AFAIK works fine in general. We sometimes aren't so great
about removing "experimental" labels. That said I know there are still
some things that could be added to it and more work going on, and
maybe people closer to that work can comment. But yeah you shouldn't
be afraid to try it.

On Thu, Jul 9, 2020 at 3:18 PM Varshney, Vaibhav
 wrote:
>
> Hi Spark Experts,
>
>
>
> We are trying to deploy spark on Kubernetes.
>
> As per doc http://spark.apache.org/docs/latest/running-on-kubernetes.html, it 
> looks like K8s deployment is experimental.
>
> "The Kubernetes scheduler is currently experimental ".
>
>
>
> Spark 3.0 does not support production deployment using k8s scheduler?
>
> What’s the plan on full support of K8s scheduler?
>
>
>
> Thanks,
>
> Vaibhav V

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

2020-07-09 Thread Varshney, Vaibhav
Hi Spark Experts,

We are trying to deploy spark on Kubernetes.
As per doc http://spark.apache.org/docs/latest/running-on-kubernetes.html, it 
looks like K8s deployment is experimental.
"The Kubernetes scheduler is currently experimental ".

Spark 3.0 does not support production deployment using k8s scheduler?
What's the plan on full support of K8s scheduler?

Thanks,
Vaibhav V


  1   2   >