Re: Spark on Kubernetes

2024-04-30 Thread Mich Talebzadeh
Hi,
In k8s the driver is responsible for executor creation. The likelihood of
your problem is that Insufficient memory allocated for executors in the K8s
cluster. Even with dynamic allocation, k8s won't  schedule executor pods if
there is not enough free memory to fulfill their resource requests.

My suggestions

   - Increase Executor Memory: Allocate more memory per executor (e.g., 2GB
   or 3GB) to allow for multiple executors within available cluster memory.
   - Adjust Driver Pod Resources: Ensure the driver pod has enough memory
   to run Spark and manage executors.
   - Optimize Resource Management: Explore on-demand allocation or
   adjusting allocation granularity for better resource utilization. For
   example look at documents for Executor On-Demand Allocation
   (spark.executor.cores=0): and spark.dynamicAllocation.minExecutors &
   spark.dynamicAllocation.maxExecutors

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 30 Apr 2024 at 04:29, Tarun raghav 
wrote:

> Respected Sir/Madam,
> I am Tarunraghav. I have a query regarding spark on kubernetes.
>
> We have an eks cluster, within which we have spark installed in the pods.
> We set the executor memory as 1GB and set the executor instances as 2, I
> have also set dynamic allocation as true. So when I try to read a 3 GB CSV
> file or parquet file, it is supposed to increase the number of pods by 2.
> But the number of executor pods is zero.
> I don't know why executor pods aren't being created, even though I set
> executor instance as 2. Please suggest a solution for this.
>
> Thanks & Regards,
> Tarunraghav
>
>


Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Thanks for your kind words Sri

Well it is true that as yet spark on kubernetes is not on-par with spark on
YARN in maturity and essentially spark on kubernetes is still work in
progress.* So in the first place IMO one needs to think why executors are
failing. What causes this behaviour? Is it the code or some inadequate
set-up? *These things come to my mind


   - Resource Allocation: Insufficient resources (CPU, memory) can lead to
   executor failures.
   - Mis-configuration Issues: Verify that the configurations are
   appropriate for your workload.
   - External Dependencies: If your Spark job relies on external services
   or data sources, ensure they are accessible. Issues such as network
   problems or unavailability of external services can lead to executor
   failures.
   - Data Skew: Uneven distribution of data across partitions can lead to
   data skew and cause some executors to process significantly more data than
   others. This can lead to resource exhaustion on specific executors.
   - Spark Version and Kubernetes Compatibility: Is Spark running on EKS or
   GKE -- that you are using a Spark version that is compatible with your
   Kubernetes environment. These vendors normally run older, more stable
   versions of Spark. Compatibility issues can arise when using your newer
   version of Spark.
   - How up-to-date are your docker images on container registries (ECR,
   GCR).Is there any incompatibility between docker images built on a Spark
   version and the host spark version you are submitting your spark-submit
   from?

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 19 Feb 2024 at 23:18, Sri Potluri  wrote:

> Dear Mich,
>
> Thank you for your detailed response and the suggested approach to
> handling retry logic. I appreciate you taking the time to outline the
> method of embedding custom retry mechanisms directly into the application
> code.
>
> While the solution of wrapping the main logic of the Spark job in a loop
> for controlling the number of retries is technically sound and offers a
> workaround, it may not be the most efficient or maintainable solution for
> organizations running a large number of Spark jobs. Modifying each
> application to include custom retry logic can be a significant undertaking,
> introducing variability in how retries are handled across different jobs,
> and require additional testing and maintenance.
>
> Ideally, operational concerns like retry behavior in response to
> infrastructure failures should be decoupled from the business logic of
> Spark applications. This separation allows data engineers and scientists to
> focus on the application logic without needing to implement and test
> infrastructure resilience mechanisms.
>
> Thank you again for your time and assistance.
>
> Best regards,
> Sri Potluri
>
> On Mon, Feb 19, 2024 at 5:03 PM Mich Talebzadeh 
> wrote:
>
>> Went through your issue with the code running on k8s
>>
>> When an executor of a Spark application fails, the system attempts to
>> maintain the desired level of parallelism by automatically recreating a new
>> executor to replace the failed one. While this behavior is beneficial for
>> transient errors, ensuring that the application continues to run, it
>> becomes problematic in cases where the failure is due to a persistent issue
>> (such as misconfiguration, inaccessible external resources, or incompatible
>> environment settings). In such scenarios, the application enters a loop,
>> continuously trying to recreate executors, which leads to resource wastage
>> and complicates application management.
>>
>> Well fault tolerance is built especially in k8s cluster. You can
>> implement your own logic to control the retry attempts. You can do this
>> by wrapping the main logic of your Spark job in a loop and controlling the
>> number of retries. If a persistent issue is detected, you can choose to
>> stop the job. Today is the third time that looping control has come up :)
>>
>> Take this code
>>
>> import time
>> max_retries = 5 retries = 0 while retries < max_retries: try: # Your
>> Spark job logic here except Exception as e: # Log the exception
>> print(f"Exception in Spark job: {str(e)}") # Increment the retry count
>> retries += 1 # Sleep time.sleep(60) else: # Break out of the loop if the
>> job completes successfully break
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>> 

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Cheng Pan
Spark has supported the window-based executor failure-tracking mechanism for 
YARN for a long time, SPARK-41210[1][2] (included in 3.5.0) extended this 
feature to K8s.

[1] https://issues.apache.org/jira/browse/SPARK-41210
[2] https://github.com/apache/spark/pull/38732

Thanks,
Cheng Pan


> On Feb 19, 2024, at 23:59, Sri Potluri  wrote:
> 
> Hello Spark Community,
> 
> I am currently leveraging Spark on Kubernetes, managed by the Spark Operator, 
> for running various Spark applications. While the system generally works 
> well, I've encountered a challenge related to how Spark applications handle 
> executor failures, specifically in scenarios where executors enter an error 
> state due to persistent issues.
> 
> Problem Description
> 
> When an executor of a Spark application fails, the system attempts to 
> maintain the desired level of parallelism by automatically recreating a new 
> executor to replace the failed one. While this behavior is beneficial for 
> transient errors, ensuring that the application continues to run, it becomes 
> problematic in cases where the failure is due to a persistent issue (such as 
> misconfiguration, inaccessible external resources, or incompatible 
> environment settings). In such scenarios, the application enters a loop, 
> continuously trying to recreate executors, which leads to resource wastage 
> and complicates application management.
> 
> Desired Behavior
> 
> Ideally, I would like to have a mechanism to limit the number of retries for 
> executor recreation. If the system fails to successfully create an executor 
> more than a specified number of times (e.g., 5 attempts), the entire Spark 
> application should fail and stop trying to recreate the executor. This 
> behavior would help in efficiently managing resources and avoiding prolonged 
> failure states.
> 
> Questions for the Community
> 
> 1. Is there an existing configuration or method within Spark or the Spark 
> Operator to limit executor recreation attempts and fail the job after 
> reaching a threshold?
>
> 2. Has anyone else encountered similar challenges and found workarounds or 
> solutions that could be applied in this context?
> 
> 
> Additional Context
> 
> I have explored Spark's task and stage retry configurations 
> (`spark.task.maxFailures`, `spark.stage.maxConsecutiveAttempts`), but these 
> do not directly address the issue of limiting executor creation retries. 
> Implementing a custom monitoring solution to track executor failures and 
> manually stop the application is a potential workaround, but it would be 
> preferable to have a more integrated solution.
> 
> I appreciate any guidance, insights, or feedback you can provide on this 
> matter.
> 
> Thank you for your time and support.
> 
> Best regards,
> Sri P


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Sri Potluri
Dear Mich,

Thank you for your detailed response and the suggested approach to handling
retry logic. I appreciate you taking the time to outline the method of
embedding custom retry mechanisms directly into the application code.

While the solution of wrapping the main logic of the Spark job in a loop
for controlling the number of retries is technically sound and offers a
workaround, it may not be the most efficient or maintainable solution for
organizations running a large number of Spark jobs. Modifying each
application to include custom retry logic can be a significant undertaking,
introducing variability in how retries are handled across different jobs,
and require additional testing and maintenance.

Ideally, operational concerns like retry behavior in response to
infrastructure failures should be decoupled from the business logic of
Spark applications. This separation allows data engineers and scientists to
focus on the application logic without needing to implement and test
infrastructure resilience mechanisms.

Thank you again for your time and assistance.

Best regards,
Sri Potluri

On Mon, Feb 19, 2024 at 5:03 PM Mich Talebzadeh 
wrote:

> Went through your issue with the code running on k8s
>
> When an executor of a Spark application fails, the system attempts to
> maintain the desired level of parallelism by automatically recreating a new
> executor to replace the failed one. While this behavior is beneficial for
> transient errors, ensuring that the application continues to run, it
> becomes problematic in cases where the failure is due to a persistent issue
> (such as misconfiguration, inaccessible external resources, or incompatible
> environment settings). In such scenarios, the application enters a loop,
> continuously trying to recreate executors, which leads to resource wastage
> and complicates application management.
>
> Well fault tolerance is built especially in k8s cluster. You can implement 
> your
> own logic to control the retry attempts. You can do this by wrapping the
> main logic of your Spark job in a loop and controlling the number of
> retries. If a persistent issue is detected, you can choose to stop the job.
> Today is the third time that looping control has come up :)
>
> Take this code
>
> import time
> max_retries = 5 retries = 0 while retries < max_retries: try: # Your Spark
> job logic here except Exception as e: # Log the exception print(f"Exception
> in Spark job: {str(e)}") # Increment the retry count retries += 1 # Sleep
> time.sleep(60) else: # Break out of the loop if the job completes
> successfully break
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 19 Feb 2024 at 19:21, Mich Talebzadeh 
> wrote:
>
>> Not that I am aware of any configuration parameter in Spark classic to
>> limit executor creation. Because of fault tolerance Spark will try to
>> recreate failed executors. Not really that familiar with the Spark operator
>> for k8s. There may be something there.
>>
>> Have you considered custom monitoring and handling within Spark itself
>> using max_retries = 5  etc?
>>
>> HTH
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 19 Feb 2024 at 18:34, Sri Potluri  wrote:
>>
>>> Hello Spark Community,
>>>
>>> I am currently leveraging Spark on Kubernetes, managed by the Spark
>>> Operator, for running various Spark applications. While the system
>>> generally works well, I've encountered a challenge related to how Spark
>>> applications handle executor failures, specifically in scenarios where
>>> executors enter an error state due to persistent issues.
>>>
>>> *Problem Description*
>>>
>>> When an executor of a Spark application fails, the system attempts to
>>> maintain the desired level of parallelism by automatically recreating a new
>>> executor to replace the failed one. While this 

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Went through your issue with the code running on k8s

When an executor of a Spark application fails, the system attempts to
maintain the desired level of parallelism by automatically recreating a new
executor to replace the failed one. While this behavior is beneficial for
transient errors, ensuring that the application continues to run, it
becomes problematic in cases where the failure is due to a persistent issue
(such as misconfiguration, inaccessible external resources, or incompatible
environment settings). In such scenarios, the application enters a loop,
continuously trying to recreate executors, which leads to resource wastage
and complicates application management.

Well fault tolerance is built especially in k8s cluster. You can implement your
own logic to control the retry attempts. You can do this by wrapping the
main logic of your Spark job in a loop and controlling the number of
retries. If a persistent issue is detected, you can choose to stop the job.
Today is the third time that looping control has come up :)

Take this code

import time
max_retries = 5 retries = 0 while retries < max_retries: try: # Your Spark
job logic here except Exception as e: # Log the exception print(f"Exception
in Spark job: {str(e)}") # Increment the retry count retries += 1 # Sleep
time.sleep(60) else: # Break out of the loop if the job completes
successfully break

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 19 Feb 2024 at 19:21, Mich Talebzadeh 
wrote:

> Not that I am aware of any configuration parameter in Spark classic to
> limit executor creation. Because of fault tolerance Spark will try to
> recreate failed executors. Not really that familiar with the Spark operator
> for k8s. There may be something there.
>
> Have you considered custom monitoring and handling within Spark itself
> using max_retries = 5  etc?
>
> HTH
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 19 Feb 2024 at 18:34, Sri Potluri  wrote:
>
>> Hello Spark Community,
>>
>> I am currently leveraging Spark on Kubernetes, managed by the Spark
>> Operator, for running various Spark applications. While the system
>> generally works well, I've encountered a challenge related to how Spark
>> applications handle executor failures, specifically in scenarios where
>> executors enter an error state due to persistent issues.
>>
>> *Problem Description*
>>
>> When an executor of a Spark application fails, the system attempts to
>> maintain the desired level of parallelism by automatically recreating a new
>> executor to replace the failed one. While this behavior is beneficial for
>> transient errors, ensuring that the application continues to run, it
>> becomes problematic in cases where the failure is due to a persistent issue
>> (such as misconfiguration, inaccessible external resources, or incompatible
>> environment settings). In such scenarios, the application enters a loop,
>> continuously trying to recreate executors, which leads to resource wastage
>> and complicates application management.
>>
>> *Desired Behavior*
>>
>> Ideally, I would like to have a mechanism to limit the number of retries
>> for executor recreation. If the system fails to successfully create an
>> executor more than a specified number of times (e.g., 5 attempts), the
>> entire Spark application should fail and stop trying to recreate the
>> executor. This behavior would help in efficiently managing resources and
>> avoiding prolonged failure states.
>>
>> *Questions for the Community*
>>
>> 1. Is there an existing configuration or method within Spark or the Spark
>> Operator to limit executor recreation attempts and fail the job after
>> reaching a threshold?
>>
>> 2. Has anyone else encountered similar challenges and found workarounds
>> or solutions that could be applied in this context?
>>
>>
>> *Additional Context*
>>
>> I have explored Spark's task and 

Re: [Spark on Kubernetes]: Seeking Guidance on Handling Persistent Executor Failures

2024-02-19 Thread Mich Talebzadeh
Not that I am aware of any configuration parameter in Spark classic to
limit executor creation. Because of fault tolerance Spark will try to
recreate failed executors. Not really that familiar with the Spark operator
for k8s. There may be something there.

Have you considered custom monitoring and handling within Spark itself
using max_retries = 5  etc?

HTH

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 19 Feb 2024 at 18:34, Sri Potluri  wrote:

> Hello Spark Community,
>
> I am currently leveraging Spark on Kubernetes, managed by the Spark
> Operator, for running various Spark applications. While the system
> generally works well, I've encountered a challenge related to how Spark
> applications handle executor failures, specifically in scenarios where
> executors enter an error state due to persistent issues.
>
> *Problem Description*
>
> When an executor of a Spark application fails, the system attempts to
> maintain the desired level of parallelism by automatically recreating a new
> executor to replace the failed one. While this behavior is beneficial for
> transient errors, ensuring that the application continues to run, it
> becomes problematic in cases where the failure is due to a persistent issue
> (such as misconfiguration, inaccessible external resources, or incompatible
> environment settings). In such scenarios, the application enters a loop,
> continuously trying to recreate executors, which leads to resource wastage
> and complicates application management.
>
> *Desired Behavior*
>
> Ideally, I would like to have a mechanism to limit the number of retries
> for executor recreation. If the system fails to successfully create an
> executor more than a specified number of times (e.g., 5 attempts), the
> entire Spark application should fail and stop trying to recreate the
> executor. This behavior would help in efficiently managing resources and
> avoiding prolonged failure states.
>
> *Questions for the Community*
>
> 1. Is there an existing configuration or method within Spark or the Spark
> Operator to limit executor recreation attempts and fail the job after
> reaching a threshold?
>
> 2. Has anyone else encountered similar challenges and found workarounds or
> solutions that could be applied in this context?
>
>
> *Additional Context*
>
> I have explored Spark's task and stage retry configurations
> (`spark.task.maxFailures`, `spark.stage.maxConsecutiveAttempts`), but these
> do not directly address the issue of limiting executor creation retries.
> Implementing a custom monitoring solution to track executor failures and
> manually stop the application is a potential workaround, but it would be
> preferable to have a more integrated solution.
>
> I appreciate any guidance, insights, or feedback you can provide on this
> matter.
>
> Thank you for your time and support.
>
> Best regards,
> Sri P
>


Re: spark on kubernetes

2022-10-16 Thread Qian Sun
Glad to hear it!

On Sun, Oct 16, 2022 at 2:37 PM Mohammad Abdollahzade Arani <
mamadazar...@gmail.com> wrote:

> Hi Qian,
> Thanks for the reply and I'm So sorry for the late reply.
> I found the answer. My mistake was token conversion. I had to decode
> base64  the service accounts token and certificate.
> and you are right I have to use `service account cert` to configure
> spark.kubernetes.authenticate.caCertFile.
> Thanks again. best regards.
>
> On Sat, Oct 15, 2022 at 4:51 PM Qian Sun  wrote:
>
>> Hi Mohammad
>> Did you try this command?
>>
>>
>> ./bin/spark-submit  \ --master k8s://https://vm13:6443 \ --class 
>> com.example.WordCounter  \ --conf 
>> spark.kubernetes.authenticate.driver.serviceAccountName=default  \ 
>> --conf 
>> spark.kubernetes.container.image=private-docker-registery/spark/spark:3.2.1-3
>>  \ --conf spark.kubernetes.namespace=default \ 
>> java-word-count-1.0-SNAPSHOT.jar
>>
>> If you want spark.kubernetes.authenticate.caCertFile, you need to
>> configure it to serviceaccount certFile instead of apiserver certFile.
>>
>> On Sat, Oct 15, 2022 at 8:30 PM Mohammad Abdollahzade Arani
>> mamadazar...@gmail.com  wrote:
>>
>> I have a k8s cluster and a spark cluster.
>>>  my question is is as bellow:
>>>
>>>
>>> https://stackoverflow.com/questions/74053948/how-to-resolve-pods-is-forbidden-user-systemanonymous-cannot-watch-resourc
>>>
>>> I have searched and I found lot's of other similar questions on
>>> stackoverflow without an answer like  bellow:
>>>
>>>
>>> https://stackoverflow.com/questions/61982896/how-to-fix-pods-is-forbidden-user-systemanonymous-cannot-watch-resource
>>>
>>>
>>> --
>>> Best Wishes!
>>> Mohammad Abdollahzade Arani
>>> Computer Engineering @ SBU
>>>
>>> --
>> Best!
>> Qian Sun
>>
>
>
> --
> Best Wishes!
> Mohammad Abdollahzade Arani
> Computer Engineering @ SBU
>
>

-- 
Best!
Qian Sun


Re: spark on kubernetes

2022-10-15 Thread Qian Sun
Hi Mohammad
Did you try this command?


./bin/spark-submit  \ --master k8s://https://vm13:6443 \
--class com.example.WordCounter  \ --conf
spark.kubernetes.authenticate.driver.serviceAccountName=default  \
--conf 
spark.kubernetes.container.image=private-docker-registery/spark/spark:3.2.1-3
\ --conf spark.kubernetes.namespace=default \
java-word-count-1.0-SNAPSHOT.jar

If you want spark.kubernetes.authenticate.caCertFile, you need to configure
it to serviceaccount certFile instead of apiserver certFile.

On Sat, Oct 15, 2022 at 8:30 PM Mohammad Abdollahzade Arani
mamadazar...@gmail.com  wrote:

I have a k8s cluster and a spark cluster.
>  my question is is as bellow:
>
>
> https://stackoverflow.com/questions/74053948/how-to-resolve-pods-is-forbidden-user-systemanonymous-cannot-watch-resourc
>
> I have searched and I found lot's of other similar questions on
> stackoverflow without an answer like  bellow:
>
>
> https://stackoverflow.com/questions/61982896/how-to-fix-pods-is-forbidden-user-systemanonymous-cannot-watch-resource
>
>
> --
> Best Wishes!
> Mohammad Abdollahzade Arani
> Computer Engineering @ SBU
>
> --
Best!
Qian Sun


Re: Spark on Kubernetes scheduler variety

2021-07-08 Thread Mich Talebzadeh
Splendid.

Please invite me to the next meeting

mich.talebza...@gmail.com

Timezone London, UK  *GMT+1*

Thanks,


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 8 Jul 2021 at 19:04, Holden Karau  wrote:

> Hi Y'all,
>
> We had an initial meeting which went well, got some more context around
> Volcano and its near-term roadmap. Talked about the impact around scheduler
> deadlocking and some ways that we could potentially improve integration
> from the Spark side and Volcano sides respectively. I'm going to start
> creating some sub-issues under
> https://issues.apache.org/jira/browse/SPARK-36057
>
> If anyone is interested in being on the next meeting please reach out and
> I'll send an e-mail around to try and schedule re-occurring sync that works
> for folks.
>
> Cheers,
>
> Holden
>
> On Thu, Jun 24, 2021 at 8:56 AM Holden Karau  wrote:
>
>> That's awesome, I'm just starting to get context around Volcano but maybe
>> we can schedule an initial meeting for all of us interested in pursuing
>> this to get on the same page.
>>
>> On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma  wrote:
>>
>>> Hi team,
>>>
>>> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
>>> community also has such requirements :)
>>>
>>> Volcano provides several features for batch workload, e.g. fair-share,
>>> queue, reservation, preemption/reclaim and so on.
>>> It has been used in several product environments with Spark; if
>>> necessary, I can give an overall introduction about Volcano's features and
>>> those use cases :)
>>>
>>> -- Klaus
>>>
>>> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>


 Please allow me to be diverse and express a different point of view on
 this roadmap.


 I believe from a technical point of view spending time and effort plus
 talent on batch scheduling on Kubernetes could be rewarding. However, if I
 may say I doubt whether such an approach and the so-called democratization
 of Spark on whatever platform is really should be of great focus.

 Having worked on Google Dataproc  (A 
 fully
 managed and highly scalable service for running Apache Spark, Hadoop and
 more recently other artefacts) for that past two years, and Spark on
 Kubernetes on-premise, I have come to the conclusion that Spark is not a
 beast that that one can fully commoditize it much like one can do with
 Zookeeper, Kafka etc. There is always a struggle to make some niche areas
 of Spark like Spark Structured Streaming (SSS) work seamlessly and
 effortlessly on these commercial platforms with whatever as a Service.


 Moreover, Spark (and I stand corrected) from the ground up has already
 a lot of resiliency and redundancy built in. It is truly an enterprise
 class product (requires enterprise class support) that will be difficult to
 commoditize with Kubernetes and expect the same performance. After all,
 Kubernetes is aimed at efficient resource sharing and potential cost saving
 for the mass market. In short I can see commercial enterprises will work on
 these platforms ,but may be the great talents on dev team should focus on
 stuff like the perceived limitation of SSS in dealing with chain of
 aggregation( if I am correct it is not yet supported on streaming datasets)


 These are my opinions and they are not facts, just opinions so to speak
 :)


view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Fri, 18 Jun 2021 at 23:18, Holden Karau 
 wrote:

> I think these approaches are good, but there are limitations (eg
> dynamic scaling) without us making changes inside of the Spark Kube
> scheduler.
>
> Certainly whichever scheduler extensions we add support for we should
> collaborate with the people developing those extensions insofar as they 
> are
> interested. My first place that I checked was #sig-scheduling which is
> fairly quite on the Kubernetes slack but if there are more places to look
> for folks interested in batch scheduling 

Re: Spark on Kubernetes scheduler variety

2021-07-08 Thread Holden Karau
Hi Y'all,

We had an initial meeting which went well, got some more context around
Volcano and its near-term roadmap. Talked about the impact around scheduler
deadlocking and some ways that we could potentially improve integration
from the Spark side and Volcano sides respectively. I'm going to start
creating some sub-issues under
https://issues.apache.org/jira/browse/SPARK-36057

If anyone is interested in being on the next meeting please reach out and
I'll send an e-mail around to try and schedule re-occurring sync that works
for folks.

Cheers,

Holden

On Thu, Jun 24, 2021 at 8:56 AM Holden Karau  wrote:

> That's awesome, I'm just starting to get context around Volcano but maybe
> we can schedule an initial meeting for all of us interested in pursuing
> this to get on the same page.
>
> On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma  wrote:
>
>> Hi team,
>>
>> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
>> community also has such requirements :)
>>
>> Volcano provides several features for batch workload, e.g. fair-share,
>> queue, reservation, preemption/reclaim and so on.
>> It has been used in several product environments with Spark; if
>> necessary, I can give an overall introduction about Volcano's features and
>> those use cases :)
>>
>> -- Klaus
>>
>> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>>
>>>
>>> Please allow me to be diverse and express a different point of view on
>>> this roadmap.
>>>
>>>
>>> I believe from a technical point of view spending time and effort plus
>>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>>> may say I doubt whether such an approach and the so-called democratization
>>> of Spark on whatever platform is really should be of great focus.
>>>
>>> Having worked on Google Dataproc  (A 
>>> fully
>>> managed and highly scalable service for running Apache Spark, Hadoop and
>>> more recently other artefacts) for that past two years, and Spark on
>>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>>> beast that that one can fully commoditize it much like one can do with
>>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>>> effortlessly on these commercial platforms with whatever as a Service.
>>>
>>>
>>> Moreover, Spark (and I stand corrected) from the ground up has already a
>>> lot of resiliency and redundancy built in. It is truly an enterprise class
>>> product (requires enterprise class support) that will be difficult to
>>> commoditize with Kubernetes and expect the same performance. After all,
>>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>>> for the mass market. In short I can see commercial enterprises will work on
>>> these platforms ,but may be the great talents on dev team should focus on
>>> stuff like the perceived limitation of SSS in dealing with chain of
>>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>>
>>>
>>> These are my opinions and they are not facts, just opinions so to speak
>>> :)
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>>
 I think these approaches are good, but there are limitations (eg
 dynamic scaling) without us making changes inside of the Spark Kube
 scheduler.

 Certainly whichever scheduler extensions we add support for we should
 collaborate with the people developing those extensions insofar as they are
 interested. My first place that I checked was #sig-scheduling which is
 fairly quite on the Kubernetes slack but if there are more places to look
 for folks interested in batch scheduling on Kubernetes we should definitely
 give it a shot :)

 On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi,
>
> Regarding your point and I quote
>
> "..  I know that one of the Spark on Kube operators
> supports volcano/kube-batch so I was thinking that might be a place I 
> would
> start exploring..."
>
> There seems to be ongoing work on say Volcano as part of  Cloud
> Native Computing Foundation  (CNCF). For example
> through https://github.com/volcano-sh/volcano
>
 
>
> There may be value-add in collaborating with such groups 

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Mich Talebzadeh
Hi Holden,

Thank you for your points. I guess coming from a corporate world I had an
oversight on how an open source project like Spark does leverage resources
and interest :).

As @KlausMa kindly volunteered it would be good to hear scheduling ideas on
Spark on Kubernetes and of course as I am sure you have some inroads/ideas
on this subject as well, then truly I guess love would be in the air for
Kubernetes 

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 24 Jun 2021 at 16:59, Holden Karau  wrote:

> Hi Mich,
>
> I certainly think making Spark on Kubernetes run well is going to be a
> challenge. However I think, and I could be wrong about this as well, that
> in terms of cluster managers Kubernetes is likely to be our future. Talking
> with people I don't hear about new standalone, YARN or mesos deployments of
> Spark, but I do hear about people trying to migrate to Kubernetes.
>
> To be clear I certainly agree that we need more work on structured
> streaming, but its important to remember that the Spark developers are not
> all fully interchangeable, we work on the things that we're interested in
> pursuing so even if structured streaming needs more love if I'm not super
> interested in structured streaming I'm less likely to work on it. That
> being said I am certainly spinning up a bit more in the Spark SQL area
> especially around our data source/connectors because I can see the need
> there too.
>
> On Wed, Jun 23, 2021 at 8:26 AM Mich Talebzadeh 
> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc  (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Regarding your point and I quote

 "..  I know that one of the Spark on 

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Holden Karau
That's awesome, I'm just starting to get context around Volcano but maybe
we can schedule an initial meeting for all of us interested in pursuing
this to get on the same page.

On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma  wrote:

> Hi team,
>
> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
> community also has such requirements :)
>
> Volcano provides several features for batch workload, e.g. fair-share,
> queue, reservation, preemption/reclaim and so on.
> It has been used in several product environments with Spark; if necessary,
> I can give an overall introduction about Volcano's features and those use
> cases :)
>
> -- Klaus
>
> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc  (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Regarding your point and I quote

 "..  I know that one of the Spark on Kube operators
 supports volcano/kube-batch so I was thinking that might be a place I would
 start exploring..."

 There seems to be ongoing work on say Volcano as part of  Cloud Native
 Computing Foundation  (CNCF). For example through
 https://github.com/volcano-sh/volcano

>>> 

 There may be value-add in collaborating with such groups through CNCF
 in order to have a collective approach to such work. There also seems to be
 some work on Integration of Spark with Volcano for Batch Scheduling.
 



 What is not very clear is the degree of progress of these projects. You
 may be kind enough to elaborate on KPI for each of these projects and where
 you think your contributions is going to be.


 HTH,


 Mich


view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any 

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Lalwani, Jayesh
You can always chain aggregations by chaining multiple Structured Streaming 
jobs. It’s not a showstopper.

Getting Spark on Kubernetes is important for organizations that want to pursue 
a multi-cloud strategy

From: Mich Talebzadeh 
Date: Wednesday, June 23, 2021 at 11:27 AM
To: "user @spark" 
Cc: dev 
Subject: RE: [EXTERNAL] Spark on Kubernetes scheduler variety


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.




Please allow me to be diverse and express a different point of view on this 
roadmap.

I believe from a technical point of view spending time and effort plus talent 
on batch scheduling on Kubernetes could be rewarding. However, if I may say I 
doubt whether such an approach and the so-called democratization of Spark on 
whatever platform is really should be of great focus.
Having worked on Google Dataproc<https://cloud.google.com/dataproc> (A fully 
managed and highly scalable service for running Apache Spark, Hadoop and more 
recently other artefacts) for that past two years, and Spark on Kubernetes 
on-premise, I have come to the conclusion that Spark is not a beast that that 
one can fully commoditize it much like one can do with  Zookeeper, Kafka etc. 
There is always a struggle to make some niche areas of Spark like Spark 
Structured Streaming (SSS) work seamlessly and effortlessly on these commercial 
platforms with whatever as a Service.

Moreover, Spark (and I stand corrected) from the ground up has already a lot of 
resiliency and redundancy built in. It is truly an enterprise class product 
(requires enterprise class support) that will be difficult to commoditize with 
Kubernetes and expect the same performance. After all, Kubernetes is aimed at 
efficient resource sharing and potential cost saving for the mass market. In 
short I can see commercial enterprises will work on these platforms ,but may be 
the great talents on dev team should focus on stuff like the perceived 
limitation of SSS in dealing with chain of aggregation( if I am correct it is 
not yet supported on streaming datasets)

These are my opinions and they are not facts, just opinions so to speak :)

 [Image removed by sender.]   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 18 Jun 2021 at 23:18, Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
I think these approaches are good, but there are limitations (eg dynamic 
scaling) without us making changes inside of the Spark Kube scheduler.

Certainly whichever scheduler extensions we add support for we should 
collaborate with the people developing those extensions insofar as they are 
interested. My first place that I checked was #sig-scheduling which is fairly 
quite on the Kubernetes slack but if there are more places to look for folks 
interested in batch scheduling on Kubernetes we should definitely give it a 
shot :)

On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
Hi,

Regarding your point and I quote

"..  I know that one of the Spark on Kube operators supports volcano/kube-batch 
so I was thinking that might be a place I would start exploring..."

There seems to be ongoing work on say Volcano as part of  Cloud Native 
Computing Foundation<https://cncf.io/> (CNCF). For example through 
https://github.com/volcano-sh/volcano

There may be value-add in collaborating with such groups through CNCF in order 
to have a collective approach to such work. There also seems to be some work on 
Integration of Spark with Volcano for Batch 
Scheduling.<https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/volcano-integration.md>



What is not very clear is the degree of progress of these projects. You may be 
kind enough to elaborate on KPI for each of these projects and where you think 
your contributions is going to be.



HTH,



Mich


 [Image removed by sender.]   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Fri, 18 Jun 2021 at 00:44, Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
Hi Folks,

I'm continuing my adventures to make Spark on containers party a

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread John Zhuge
Thanks Klaus! I am interested in more details.

On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma  wrote:

> Hi team,
>
> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
> community also has such requirements :)
>
> Volcano provides several features for batch workload, e.g. fair-share,
> queue, reservation, preemption/reclaim and so on.
> It has been used in several product environments with Spark; if necessary,
> I can give an overall introduction about Volcano's features and those use
> cases :)
>
> -- Klaus
>
> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc  (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Regarding your point and I quote

 "..  I know that one of the Spark on Kube operators
 supports volcano/kube-batch so I was thinking that might be a place I would
 start exploring..."

 There seems to be ongoing work on say Volcano as part of  Cloud Native
 Computing Foundation  (CNCF). For example through
 https://github.com/volcano-sh/volcano

>>> 

 There may be value-add in collaborating with such groups through CNCF
 in order to have a collective approach to such work. There also seems to be
 some work on Integration of Spark with Volcano for Batch Scheduling.
 



 What is not very clear is the degree of progress of these projects. You
 may be kind enough to elaborate on KPI for each of these projects and where
 you think your contributions is going to be.


 HTH,


 Mich


view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is 

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Mich Talebzadeh
Thanks Klaus. That will be great.

It will also be intuitive if you elaborate the need for this feature in
line with the limitation of the current batch workload.

Regards,

Mich



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 24 Jun 2021 at 02:53, Klaus Ma  wrote:

> Hi team,
>
> I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
> community also has such requirements :)
>
> Volcano provides several features for batch workload, e.g. fair-share,
> queue, reservation, preemption/reclaim and so on.
> It has been used in several product environments with Spark; if necessary,
> I can give an overall introduction about Volcano's features and those use
> cases :)
>
> -- Klaus
>
> On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>>
>> Please allow me to be diverse and express a different point of view on
>> this roadmap.
>>
>>
>> I believe from a technical point of view spending time and effort plus
>> talent on batch scheduling on Kubernetes could be rewarding. However, if I
>> may say I doubt whether such an approach and the so-called democratization
>> of Spark on whatever platform is really should be of great focus.
>>
>> Having worked on Google Dataproc  (A fully
>> managed and highly scalable service for running Apache Spark, Hadoop and
>> more recently other artefacts) for that past two years, and Spark on
>> Kubernetes on-premise, I have come to the conclusion that Spark is not a
>> beast that that one can fully commoditize it much like one can do with
>> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
>> of Spark like Spark Structured Streaming (SSS) work seamlessly and
>> effortlessly on these commercial platforms with whatever as a Service.
>>
>>
>> Moreover, Spark (and I stand corrected) from the ground up has already a
>> lot of resiliency and redundancy built in. It is truly an enterprise class
>> product (requires enterprise class support) that will be difficult to
>> commoditize with Kubernetes and expect the same performance. After all,
>> Kubernetes is aimed at efficient resource sharing and potential cost saving
>> for the mass market. In short I can see commercial enterprises will work on
>> these platforms ,but may be the great talents on dev team should focus on
>> stuff like the perceived limitation of SSS in dealing with chain of
>> aggregation( if I am correct it is not yet supported on streaming datasets)
>>
>>
>> These are my opinions and they are not facts, just opinions so to speak :)
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>>
>>> I think these approaches are good, but there are limitations (eg dynamic
>>> scaling) without us making changes inside of the Spark Kube scheduler.
>>>
>>> Certainly whichever scheduler extensions we add support for we should
>>> collaborate with the people developing those extensions insofar as they are
>>> interested. My first place that I checked was #sig-scheduling which is
>>> fairly quite on the Kubernetes slack but if there are more places to look
>>> for folks interested in batch scheduling on Kubernetes we should definitely
>>> give it a shot :)
>>>
>>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi,

 Regarding your point and I quote

 "..  I know that one of the Spark on Kube operators
 supports volcano/kube-batch so I was thinking that might be a place I would
 start exploring..."

 There seems to be ongoing work on say Volcano as part of  Cloud Native
 Computing Foundation  (CNCF). For example through
 https://github.com/volcano-sh/volcano

>>> 

 There may be value-add in collaborating with such groups through CNCF
 in order to have a collective approach to such work. There also seems to be
 some work on Integration of Spark with Volcano for Batch Scheduling.
 



 What is not very 

Re: Spark on Kubernetes scheduler variety

2021-06-23 Thread Klaus Ma
Hi team,

I'm kube-batch/Volcano founder, and I'm excited to hear that the spark
community also has such requirements :)

Volcano provides several features for batch workload, e.g. fair-share,
queue, reservation, preemption/reclaim and so on.
It has been used in several product environments with Spark; if necessary,
I can give an overall introduction about Volcano's features and those use
cases :)

-- Klaus

On Wed, Jun 23, 2021 at 11:26 PM Mich Talebzadeh 
wrote:

>
>
> Please allow me to be diverse and express a different point of view on
> this roadmap.
>
>
> I believe from a technical point of view spending time and effort plus
> talent on batch scheduling on Kubernetes could be rewarding. However, if I
> may say I doubt whether such an approach and the so-called democratization
> of Spark on whatever platform is really should be of great focus.
>
> Having worked on Google Dataproc  (A fully
> managed and highly scalable service for running Apache Spark, Hadoop and
> more recently other artefacts) for that past two years, and Spark on
> Kubernetes on-premise, I have come to the conclusion that Spark is not a
> beast that that one can fully commoditize it much like one can do with
> Zookeeper, Kafka etc. There is always a struggle to make some niche areas
> of Spark like Spark Structured Streaming (SSS) work seamlessly and
> effortlessly on these commercial platforms with whatever as a Service.
>
>
> Moreover, Spark (and I stand corrected) from the ground up has already a
> lot of resiliency and redundancy built in. It is truly an enterprise class
> product (requires enterprise class support) that will be difficult to
> commoditize with Kubernetes and expect the same performance. After all,
> Kubernetes is aimed at efficient resource sharing and potential cost saving
> for the mass market. In short I can see commercial enterprises will work on
> these platforms ,but may be the great talents on dev team should focus on
> stuff like the perceived limitation of SSS in dealing with chain of
> aggregation( if I am correct it is not yet supported on streaming datasets)
>
>
> These are my opinions and they are not facts, just opinions so to speak :)
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:
>
>> I think these approaches are good, but there are limitations (eg dynamic
>> scaling) without us making changes inside of the Spark Kube scheduler.
>>
>> Certainly whichever scheduler extensions we add support for we should
>> collaborate with the people developing those extensions insofar as they are
>> interested. My first place that I checked was #sig-scheduling which is
>> fairly quite on the Kubernetes slack but if there are more places to look
>> for folks interested in batch scheduling on Kubernetes we should definitely
>> give it a shot :)
>>
>> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Regarding your point and I quote
>>>
>>> "..  I know that one of the Spark on Kube operators
>>> supports volcano/kube-batch so I was thinking that might be a place I would
>>> start exploring..."
>>>
>>> There seems to be ongoing work on say Volcano as part of  Cloud Native
>>> Computing Foundation  (CNCF). For example through
>>> https://github.com/volcano-sh/volcano
>>>
>> 
>>>
>>> There may be value-add in collaborating with such groups through CNCF in
>>> order to have a collective approach to such work. There also seems to be
>>> some work on Integration of Spark with Volcano for Batch Scheduling.
>>> 
>>>
>>>
>>>
>>> What is not very clear is the degree of progress of these projects. You
>>> may be kind enough to elaborate on KPI for each of these projects and where
>>> you think your contributions is going to be.
>>>
>>>
>>> HTH,
>>>
>>>
>>> Mich
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 18 Jun 2021 at 00:44, Holden Karau  wrote:
>>>
 Hi Folks,

 I'm continuing 

Re: Spark on Kubernetes scheduler variety

2021-06-23 Thread Mich Talebzadeh
Please allow me to be diverse and express a different point of view on
this roadmap.


I believe from a technical point of view spending time and effort plus
talent on batch scheduling on Kubernetes could be rewarding. However, if I
may say I doubt whether such an approach and the so-called democratization
of Spark on whatever platform is really should be of great focus.

Having worked on Google Dataproc  (A fully
managed and highly scalable service for running Apache Spark, Hadoop and
more recently other artefacts) for that past two years, and Spark on
Kubernetes on-premise, I have come to the conclusion that Spark is not a
beast that that one can fully commoditize it much like one can do with
Zookeeper, Kafka etc. There is always a struggle to make some niche areas
of Spark like Spark Structured Streaming (SSS) work seamlessly and
effortlessly on these commercial platforms with whatever as a Service.


Moreover, Spark (and I stand corrected) from the ground up has already a
lot of resiliency and redundancy built in. It is truly an enterprise class
product (requires enterprise class support) that will be difficult to
commoditize with Kubernetes and expect the same performance. After all,
Kubernetes is aimed at efficient resource sharing and potential cost saving
for the mass market. In short I can see commercial enterprises will work on
these platforms ,but may be the great talents on dev team should focus on
stuff like the perceived limitation of SSS in dealing with chain of
aggregation( if I am correct it is not yet supported on streaming datasets)


These are my opinions and they are not facts, just opinions so to speak :)


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 18 Jun 2021 at 23:18, Holden Karau  wrote:

> I think these approaches are good, but there are limitations (eg dynamic
> scaling) without us making changes inside of the Spark Kube scheduler.
>
> Certainly whichever scheduler extensions we add support for we should
> collaborate with the people developing those extensions insofar as they are
> interested. My first place that I checked was #sig-scheduling which is
> fairly quite on the Kubernetes slack but if there are more places to look
> for folks interested in batch scheduling on Kubernetes we should definitely
> give it a shot :)
>
> On Fri, Jun 18, 2021 at 1:41 AM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> Regarding your point and I quote
>>
>> "..  I know that one of the Spark on Kube operators
>> supports volcano/kube-batch so I was thinking that might be a place I would
>> start exploring..."
>>
>> There seems to be ongoing work on say Volcano as part of  Cloud Native
>> Computing Foundation  (CNCF). For example through
>> https://github.com/volcano-sh/volcano
>>
> 
>>
>> There may be value-add in collaborating with such groups through CNCF in
>> order to have a collective approach to such work. There also seems to be
>> some work on Integration of Spark with Volcano for Batch Scheduling.
>> 
>>
>>
>>
>> What is not very clear is the degree of progress of these projects. You
>> may be kind enough to elaborate on KPI for each of these projects and where
>> you think your contributions is going to be.
>>
>>
>> HTH,
>>
>>
>> Mich
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 18 Jun 2021 at 00:44, Holden Karau  wrote:
>>
>>> Hi Folks,
>>>
>>> I'm continuing my adventures to make Spark on containers party and I
>>> was wondering if folks have experience with the different batch
>>> scheduler options that they prefer? I was thinking so that we can
>>> better support dynamic allocation it might make sense for us to
>>> support using different schedulers and I wanted to see if there are
>>> any that the community is more interested in?
>>>
>>> I know that one of the Spark on Kube operators supports
>>> volcano/kube-batch so I was thinking that might be a place I start
>>> exploring but also want to be open to other schedulers that folks
>>> might be interested in.
>>>
>>> Cheers,
>>>
>>> Holden :)

Re: [Spark in Kubernetes] Question about running in client mode

2021-04-27 Thread Shiqi Sun
Hi Attila,

Ah that makes sense. Thanks for the clarification!

Best,
Shiqi

On Mon, Apr 26, 2021 at 8:09 PM Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:

> Hi Shiqi,
>
> In case of client mode the driver runs locally: in the same machine, even
> in the same process, of the spark submit.
>
> So if the application was submitted in a running POD then the driver will
> be running in a POD and when outside of K8s then it will be running
> outside.
> This is why there is no config mentioned for this.
>
> From the deploy mode in general you can read here:
> https://spark.apache.org/docs/latest/submitting-applications.html
>
> Best Regards,
> Attila
>
> On Tue, Apr 27, 2021 at 12:03 AM Shiqi Sun  wrote:
>
>> Hi Spark User group,
>>
>> I have a couple of quick questions about running Spark in Kubernetes
>> between different deploy modes.
>>
>> As specified in
>> https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode,
>> since Spark 2.4, client mode support is available when running in
>> Kubernetes, and it says "when your application runs in client mode, the
>> driver can run inside a pod or on a physical host". Then here come the
>> questions.
>>
>> 1. If I understand correctly, in cluster mode, the driver is also running
>> inside a k8s pod. Then, what's the difference between running it in cluster
>> mode, versus running it in client mode when I choose to run my driver in a
>> pod?
>>
>> 2. What does it mean by "running driver on a physical host"? Does it mean
>> that it runs outside of the k8s cluster? What config should I pass to spark
>> submit so that it runs this way, instead of running my driver into a k8s
>> pod?
>>
>> Thanks!
>>
>> Best,
>> Shiqi
>>
>


Re: [Spark in Kubernetes] Question about running in client mode

2021-04-26 Thread Attila Zsolt Piros
Hi Shiqi,

In case of client mode the driver runs locally: in the same machine, even
in the same process, of the spark submit.

So if the application was submitted in a running POD then the driver will
be running in a POD and when outside of K8s then it will be running
outside.
This is why there is no config mentioned for this.

>From the deploy mode in general you can read here:
https://spark.apache.org/docs/latest/submitting-applications.html

Best Regards,
Attila

On Tue, Apr 27, 2021 at 12:03 AM Shiqi Sun  wrote:

> Hi Spark User group,
>
> I have a couple of quick questions about running Spark in Kubernetes
> between different deploy modes.
>
> As specified in
> https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode,
> since Spark 2.4, client mode support is available when running in
> Kubernetes, and it says "when your application runs in client mode, the
> driver can run inside a pod or on a physical host". Then here come the
> questions.
>
> 1. If I understand correctly, in cluster mode, the driver is also running
> inside a k8s pod. Then, what's the difference between running it in cluster
> mode, versus running it in client mode when I choose to run my driver in a
> pod?
>
> 2. What does it mean by "running driver on a physical host"? Does it mean
> that it runs outside of the k8s cluster? What config should I pass to spark
> submit so that it runs this way, instead of running my driver into a k8s
> pod?
>
> Thanks!
>
> Best,
> Shiqi
>


RE: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

2021-03-11 Thread Ranju Jain
Ok!

Thanks for all guidance :-)

Regards
Ranju

From: Mich Talebzadeh 
Sent: Thursday, March 11, 2021 11:07 PM
To: Ranju Jain 
Cc: user@spark.apache.org
Subject: Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

I don't have any specific reference. However, you can do a Google search.

best to ask the Unix team. They can do all that themselves.

HTHT





LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw







Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 11 Mar 2021 at 12:53, Ranju Jain 
mailto:ranju.j...@ericsson.com>> wrote:
Yes, there is a Team but I have not contacted them yet.
Trying to understand at my end.

I understood your point you mentioned below:

Do you have any reference or links where I can check out the Shared Volumes ?

Regards
Ranju

From: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Sent: Thursday, March 11, 2021 5:38 PM
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

Well your mileage varies so to speak.


The only way to find out is setting an NFS mount and testing it.



The performance will depend on the mounted file system and the amount of cache 
it has.



File cache is important for reads and if you are going to do random writes (as 
opposed to sequential writes), then you can stripe the volume (RAID 1) for 
better performance.



Do you have a UNIX admin who can help you out as well?



HTH



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw







Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 11 Mar 2021 at 12:01, Ranju Jain 
mailto:ranju.j...@ericsson.com>> wrote:
Hi Mich,

No, it is not Google cloud. It is simply Kubernetes deployed over Bare Metal 
Platform.
I am not clear for pros and cons of Shared Volume vs NFS for Read Write Many.
As NFS is Network File Server [remote] , so I can figure out that Shared Volume 
should be more preferable, but don’t know the other sides [drawback].

Regards
Ranju
From: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Sent: Thursday, March 11, 2021 5:22 PM
To: Ranju Jain 
mailto:ranju.j...@ericsson.com.invalid>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

Ok this is on Google Cloud correct?







LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw







Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 11 Mar 2021 at 11:29, Ranju Jain 
mailto:ranju.j...@ericsson.com.invalid>> wrote:
Hi,

I need to write all Executors pods data on some common location  which can be 
accessed and retrieved by driver pod.
I was first planning to go with NFS, but I think Shared Volume is equally good.
Please suggest Is there any major drawback in using Shared Volume instead of 
NFS when many pods are writing  on the same Volume [ReadWriteMany].

Regards
Ranju


Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

2021-03-11 Thread Mich Talebzadeh
I don't have any specific reference. However, you can do a Google search.

best to ask the Unix team. They can do all that themselves.

HTHT



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 11 Mar 2021 at 12:53, Ranju Jain  wrote:

> Yes, there is a Team but I have not contacted them yet.
>
> Trying to understand at my end.
>
>
>
> I understood your point you mentioned below:
>
>
>
> Do you have any reference or links where I can check out the Shared
> Volumes ?
>
>
>
> Regards
>
> Ranju
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Thursday, March 11, 2021 5:38 PM
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS
>
>
>
> Well your mileage varies so to speak.
>
>
>
> The only way to find out is setting an NFS mount and testing it.
>
>
>
> The performance will depend on the mounted file system and the amount of
> cache it has.
>
>
>
> File cache is important for reads and if you are going to do random writes
> (as opposed to sequential writes), then you can stripe the volume (RAID 1)
> for better performance.
>
>
>
> Do you have a UNIX admin who can help you out as well?
>
>
>
> HTH
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Thu, 11 Mar 2021 at 12:01, Ranju Jain  wrote:
>
> Hi Mich,
>
>
>
> No, it is not Google cloud. It is simply Kubernetes deployed over Bare
> Metal Platform.
>
> I am not clear for pros and cons of Shared Volume vs NFS for Read Write
> Many.
>
> As NFS is Network File Server [remote] , so I can figure out that Shared
> Volume should be more preferable, but don’t know the other sides [drawback].
>
>
>
> Regards
>
> Ranju
>
> *From:* Mich Talebzadeh 
> *Sent:* Thursday, March 11, 2021 5:22 PM
> *To:* Ranju Jain 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS
>
>
>
> Ok this is on Google Cloud correct?
>
>
>
>
>
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Thu, 11 Mar 2021 at 11:29, Ranju Jain 
> wrote:
>
> Hi,
>
>
>
> I need to write all Executors pods data on some common location  which can
> be accessed and retrieved by driver pod.
>
> I was first planning to go with NFS, but I think Shared Volume is equally
> good.
>
> Please suggest Is there any major drawback in using Shared Volume instead
> of NFS when many pods are writing  on the same Volume [ReadWriteMany].
>
>
>
> Regards
>
> Ranju
>
>


RE: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

2021-03-11 Thread Ranju Jain
Yes, there is a Team but I have not contacted them yet.
Trying to understand at my end.

I understood your point you mentioned below:

Do you have any reference or links where I can check out the Shared Volumes ?

Regards
Ranju

From: Mich Talebzadeh 
Sent: Thursday, March 11, 2021 5:38 PM
Cc: user@spark.apache.org
Subject: Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

Well your mileage varies so to speak.


The only way to find out is setting an NFS mount and testing it.



The performance will depend on the mounted file system and the amount of cache 
it has.



File cache is important for reads and if you are going to do random writes (as 
opposed to sequential writes), then you can stripe the volume (RAID 1) for 
better performance.



Do you have a UNIX admin who can help you out as well?



HTH



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw







Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 11 Mar 2021 at 12:01, Ranju Jain 
mailto:ranju.j...@ericsson.com>> wrote:
Hi Mich,

No, it is not Google cloud. It is simply Kubernetes deployed over Bare Metal 
Platform.
I am not clear for pros and cons of Shared Volume vs NFS for Read Write Many.
As NFS is Network File Server [remote] , so I can figure out that Shared Volume 
should be more preferable, but don’t know the other sides [drawback].

Regards
Ranju
From: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Sent: Thursday, March 11, 2021 5:22 PM
To: Ranju Jain 
mailto:ranju.j...@ericsson.com.invalid>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

Ok this is on Google Cloud correct?







LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw







Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 11 Mar 2021 at 11:29, Ranju Jain 
mailto:ranju.j...@ericsson.com.invalid>> wrote:
Hi,

I need to write all Executors pods data on some common location  which can be 
accessed and retrieved by driver pod.
I was first planning to go with NFS, but I think Shared Volume is equally good.
Please suggest Is there any major drawback in using Shared Volume instead of 
NFS when many pods are writing  on the same Volume [ReadWriteMany].

Regards
Ranju


Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

2021-03-11 Thread Mich Talebzadeh
Well your mileage varies so to speak.

The only way to find out is setting an NFS mount and testing it.


The performance will depend on the mounted file system and the amount of
cache it has.


File cache is important for reads and if you are going to do random writes
(as opposed to sequential writes), then you can stripe the volume (RAID 1)
for better performance.


Do you have a UNIX admin who can help you out as well?


HTH


LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 11 Mar 2021 at 12:01, Ranju Jain  wrote:

> Hi Mich,
>
>
>
> No, it is not Google cloud. It is simply Kubernetes deployed over Bare
> Metal Platform.
>
> I am not clear for pros and cons of Shared Volume vs NFS for Read Write
> Many.
>
> As NFS is Network File Server [remote] , so I can figure out that Shared
> Volume should be more preferable, but don’t know the other sides [drawback].
>
>
>
> Regards
>
> Ranju
>
> *From:* Mich Talebzadeh 
> *Sent:* Thursday, March 11, 2021 5:22 PM
> *To:* Ranju Jain 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS
>
>
>
> Ok this is on Google Cloud correct?
>
>
>
>
>
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Thu, 11 Mar 2021 at 11:29, Ranju Jain 
> wrote:
>
> Hi,
>
>
>
> I need to write all Executors pods data on some common location  which can
> be accessed and retrieved by driver pod.
>
> I was first planning to go with NFS, but I think Shared Volume is equally
> good.
>
> Please suggest Is there any major drawback in using Shared Volume instead
> of NFS when many pods are writing  on the same Volume [ReadWriteMany].
>
>
>
> Regards
>
> Ranju
>
>


RE: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

2021-03-11 Thread Ranju Jain
Hi Mich,

No, it is not Google cloud. It is simply Kubernetes deployed over Bare Metal 
Platform.
I am not clear for pros and cons of Shared Volume vs NFS for Read Write Many.
As NFS is Network File Server [remote] , so I can figure out that Shared Volume 
should be more preferable, but don’t know the other sides [drawback].

Regards
Ranju
From: Mich Talebzadeh 
Sent: Thursday, March 11, 2021 5:22 PM
To: Ranju Jain 
Cc: user@spark.apache.org
Subject: Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

Ok this is on Google Cloud correct?







LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw







Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 11 Mar 2021 at 11:29, Ranju Jain 
mailto:ranju.j...@ericsson.com.invalid>> wrote:
Hi,

I need to write all Executors pods data on some common location  which can be 
accessed and retrieved by driver pod.
I was first planning to go with NFS, but I think Shared Volume is equally good.
Please suggest Is there any major drawback in using Shared Volume instead of 
NFS when many pods are writing  on the same Volume [ReadWriteMany].

Regards
Ranju


Re: Spark on Kubernetes | 3.0.1 | Shared Volume or NFS

2021-03-11 Thread Mich Talebzadeh
Ok this is on Google Cloud correct?




LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 11 Mar 2021 at 11:29, Ranju Jain 
wrote:

> Hi,
>
>
>
> I need to write all Executors pods data on some common location  which can
> be accessed and retrieved by driver pod.
>
> I was first planning to go with NFS, but I think Shared Volume is equally
> good.
>
> Please suggest Is there any major drawback in using Shared Volume instead
> of NFS when many pods are writing  on the same Volume [ReadWriteMany].
>
>
>
> Regards
>
> Ranju
>


RE: Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread Loic DESCOTTE
Everything is working fine now 
Thanks again

Loïc

De : German Schiavon 
Envoyé : mercredi 16 décembre 2020 19:23
À : Loic DESCOTTE 
Cc : user@spark.apache.org 
Objet : Re: Spark on Kubernetes : unable to write files to HDFS

We all been there! no reason to be ashamed :)

On Wed, 16 Dec 2020 at 18:14, Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>> 
wrote:
Oh thank you you're right!! I feel shameful 


De : German Schiavon mailto:gschiavonsp...@gmail.com>>
Envoyé : mercredi 16 décembre 2020 18:01
À : Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>>
Cc : user@spark.apache.org<mailto:user@spark.apache.org> 
mailto:user@spark.apache.org>>
Objet : Re: Spark on Kubernetes : unable to write files to HDFS

Hi,

seems that you have a typo no?

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hfds

  
data.write.mode("overwrite").format("text").save("hfds://hdfs-namenode/user/loic/result.txt")


On Wed, 16 Dec 2020 at 17:02, Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>> 
wrote:
So I've tried several other things, including building a fat jar with hdfs 
dependency inside my app jar, and added this to the Spark configuration in the 
code :

val spark = SparkSession
  .builder()
  .appName("Hello Spark 7")
  .config("fs.hdfs.impl", 
classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
  .getOrCreate()


But still the same error...


De : Sean Owen mailto:sro...@gmail.com>>
Envoyé : mercredi 16 décembre 2020 14:27
À : Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>>
Objet : Re: Spark on Kubernetes : unable to write files to HDFS

I think it'll have to be part of the Spark distro, but I'm not 100% sure. I 
also think these get registered via manifest files in the JARs; if some process 
is stripping those when creating a bundled up JAR, could be it. Could be that 
it's failing to initialize too for some reason.

On Wed, Dec 16, 2020 at 7:24 AM Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>> 
wrote:
I've tried with this spark-submit option :

--packages 
org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \

But it did't solve the issue.
Should I add more jars?

Thanks
Loïc

De : Sean Owen mailto:sro...@gmail.com>>
Envoyé : mercredi 16 décembre 2020 14:20
À : Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>>
Objet : Re: Spark on Kubernetes : unable to write files to HDFS

Seems like your Spark cluster doesn't somehow have the Hadoop JARs?

On Wed, Dec 16, 2020 at 6:45 AM Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>> 
wrote:
Hello,

I am using Spark On Kubernetes and I have the following error when I try to 
write data on HDFS : "no filesystem for scheme hdfs"

More details :

I am submitting my application with Spark submit like this :

spark-submit --master k8s://https://myK8SMaster:6443 \
--deploy-mode cluster \
--name hello-spark \
--class Hello \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=gradiant/spark:2.4.4 
hdfs://hdfs-namenode/user/loic/jars/helloSpark.jar

Then the driver and the 2 executors are created in K8S.

But it fails when I look at the logs of the driver, I see this :

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hfds
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at 
org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:424)
at 
org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:524)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at Hello$.main(hello.scala:24)
at Hello.main(hello.scala)


As you can see , my application jar helloSpark.jar file is correctly loaded on 
HDFS by the Spark submit, but writing to HDFS fails.

I have also tried to add the hadoop client dand hdfs dependencies in the spark 
submit command:

--packages 
org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:

Re: Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread German Schiavon
We all been there! no reason to be ashamed :)

On Wed, 16 Dec 2020 at 18:14, Loic DESCOTTE <
loic.desco...@kaizen-solutions.net> wrote:

> Oh thank you you're right!! I feel shameful 
>
> --
> *De :* German Schiavon 
> *Envoyé :* mercredi 16 décembre 2020 18:01
> *À :* Loic DESCOTTE 
> *Cc :* user@spark.apache.org 
> *Objet :* Re: Spark on Kubernetes : unable to write files to HDFS
>
> Hi,
>
> seems that you have a typo no?
>
> Exception in thread "main" java.io.IOException: No FileSystem for scheme:
> hfds
>
>   data.write.mode("overwrite").format("text").save("hfds://
> hdfs-namenode/user/loic/result.txt")
>
>
> On Wed, 16 Dec 2020 at 17:02, Loic DESCOTTE <
> loic.desco...@kaizen-solutions.net> wrote:
>
> So I've tried several other things, including building a fat jar with hdfs
> dependency inside my app jar, and added this to the Spark configuration in
> the code :
>
> val spark = SparkSession
>   .builder()
>   .appName("Hello Spark 7")
>   .config("fs.hdfs.impl", classOf[org.apache.hadoop.hdfs.
> DistributedFileSystem].getName)
>   .getOrCreate()
>
>
> But still the same error...
>
> --
> *De :* Sean Owen 
> *Envoyé :* mercredi 16 décembre 2020 14:27
> *À :* Loic DESCOTTE 
> *Objet :* Re: Spark on Kubernetes : unable to write files to HDFS
>
> I think it'll have to be part of the Spark distro, but I'm not 100% sure.
> I also think these get registered via manifest files in the JARs; if some
> process is stripping those when creating a bundled up JAR, could be it.
> Could be that it's failing to initialize too for some reason.
>
> On Wed, Dec 16, 2020 at 7:24 AM Loic DESCOTTE <
> loic.desco...@kaizen-solutions.net> wrote:
>
> I've tried with this spark-submit option :
>
> --packages
> org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \
>
> But it did't solve the issue.
> Should I add more jars?
>
> Thanks
> Loïc
> --
> *De :* Sean Owen 
> *Envoyé :* mercredi 16 décembre 2020 14:20
> *À :* Loic DESCOTTE 
> *Objet :* Re: Spark on Kubernetes : unable to write files to HDFS
>
> Seems like your Spark cluster doesn't somehow have the Hadoop JARs?
>
> On Wed, Dec 16, 2020 at 6:45 AM Loic DESCOTTE <
> loic.desco...@kaizen-solutions.net> wrote:
>
> Hello,
>
> I am using Spark On Kubernetes and I have the following error when I try
> to write data on HDFS : "no filesystem for scheme hdfs"
>
> More details :
>
> I am submitting my application with Spark submit like this :
>
> spark-submit --master k8s://https://myK8SMaster:6443 \
> --deploy-mode cluster \
> --name hello-spark \
> --class Hello \
> --conf spark.executor.instances=2 \
> --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.container.image=gradiant/spark:2.4.4
> hdfs://hdfs-namenode/user/loic/jars/helloSpark.jar
>
> Then the driver and the 2 executors are created in K8S.
>
> But it fails when I look at the logs of the driver, I see this :
>
> Exception in thread "main" java.io.IOException: No FileSystem for scheme:
> hfds
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at
> org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:424)
> at
> org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:524)
> at
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
> at Hello$.main(hello.scala:24)
> at Hello.main(hello.scala)
>
>
> As you can see , my application jar helloSpark.jar file is correctly
> loaded on HDFS by the Spark submit, but writing to HDFS fails.
>
> I have also tried to add the hadoop client dand hdfs dependencies in the
> spark submit command:
>
> --packages
> org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \
&

RE: Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread Loic DESCOTTE
Oh thank you you're right!! I feel shameful ??


De : German Schiavon 
Envoyé : mercredi 16 décembre 2020 18:01
À : Loic DESCOTTE 
Cc : user@spark.apache.org 
Objet : Re: Spark on Kubernetes : unable to write files to HDFS

Hi,

seems that you have a typo no?

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hfds

  
data.write.mode("overwrite").format("text").save("hfds://hdfs-namenode/user/loic/result.txt")


On Wed, 16 Dec 2020 at 17:02, Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>> 
wrote:
So I've tried several other things, including building a fat jar with hdfs 
dependency inside my app jar, and added this to the Spark configuration in the 
code :

val spark = SparkSession
  .builder()
  .appName("Hello Spark 7")
  .config("fs.hdfs.impl", 
classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
  .getOrCreate()


But still the same error...


De : Sean Owen mailto:sro...@gmail.com>>
Envoyé : mercredi 16 décembre 2020 14:27
À : Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>>
Objet : Re: Spark on Kubernetes : unable to write files to HDFS

I think it'll have to be part of the Spark distro, but I'm not 100% sure. I 
also think these get registered via manifest files in the JARs; if some process 
is stripping those when creating a bundled up JAR, could be it. Could be that 
it's failing to initialize too for some reason.

On Wed, Dec 16, 2020 at 7:24 AM Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>> 
wrote:
I've tried with this spark-submit option :

--packages 
org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \

But it did't solve the issue.
Should I add more jars?

Thanks
Loïc

De : Sean Owen mailto:sro...@gmail.com>>
Envoyé : mercredi 16 décembre 2020 14:20
À : Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>>
Objet : Re: Spark on Kubernetes : unable to write files to HDFS

Seems like your Spark cluster doesn't somehow have the Hadoop JARs?

On Wed, Dec 16, 2020 at 6:45 AM Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>> 
wrote:
Hello,

I am using Spark On Kubernetes and I have the following error when I try to 
write data on HDFS : "no filesystem for scheme hdfs"

More details :

I am submitting my application with Spark submit like this :

spark-submit --master k8s://https://myK8SMaster:6443 \
--deploy-mode cluster \
--name hello-spark \
--class Hello \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=gradiant/spark:2.4.4 
hdfs://hdfs-namenode/user/loic/jars/helloSpark.jar

Then the driver and the 2 executors are created in K8S.

But it fails when I look at the logs of the driver, I see this :

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hfds
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at 
org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:424)
at 
org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:524)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at Hello$.main(hello.scala:24)
at Hello.main(hello.scala)


As you can see , my application jar helloSpark.jar file is correctly loaded on 
HDFS by the Spark submit, but writing to HDFS fails.

I have also tried to add the hadoop client dand hdfs dependencies in the spark 
submit command:

--packages 
org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \

But the error is still here.


Here is the Scala code of my application :


import java.util.Calendar

import org.apache.spark.sql.SparkSession

case class Data(singleField: String)

object Hello
{
def main(args: Array[String])
{

val spark = SparkSession
  .builder()
  .appName("Hello Spark")
  .getOrCreate()

import spark.implicits._

val now = Calendar.getInstance().getTime().toString
val data = List(Data(now)).toDF()

data.write.mode("overwrite").format("text").save("hfds://hdfs-namenode/user/loic/result.txt")
}
}

Thanks for your help,
Loïc


Re: Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread German Schiavon
Hi,

seems that you have a typo no?

Exception in thread "main" java.io.IOException: No FileSystem for scheme:
hfds

  data.write.mode("overwrite").format("text").save("hfds://
hdfs-namenode/user/loic/result.txt")


On Wed, 16 Dec 2020 at 17:02, Loic DESCOTTE <
loic.desco...@kaizen-solutions.net> wrote:

> So I've tried several other things, including building a fat jar with hdfs
> dependency inside my app jar, and added this to the Spark configuration in
> the code :
>
> val spark = SparkSession
>   .builder()
>   .appName("Hello Spark 7")
>   .config("fs.hdfs.impl", classOf[org.apache.hadoop.hdfs.
> DistributedFileSystem].getName)
>   .getOrCreate()
>
>
> But still the same error...
>
> ------
> *De :* Sean Owen 
> *Envoyé :* mercredi 16 décembre 2020 14:27
> *À :* Loic DESCOTTE 
> *Objet :* Re: Spark on Kubernetes : unable to write files to HDFS
>
> I think it'll have to be part of the Spark distro, but I'm not 100% sure.
> I also think these get registered via manifest files in the JARs; if some
> process is stripping those when creating a bundled up JAR, could be it.
> Could be that it's failing to initialize too for some reason.
>
> On Wed, Dec 16, 2020 at 7:24 AM Loic DESCOTTE <
> loic.desco...@kaizen-solutions.net> wrote:
>
> I've tried with this spark-submit option :
>
> --packages
> org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \
>
> But it did't solve the issue.
> Should I add more jars?
>
> Thanks
> Loïc
> --
> *De :* Sean Owen 
> *Envoyé :* mercredi 16 décembre 2020 14:20
> *À :* Loic DESCOTTE 
> *Objet :* Re: Spark on Kubernetes : unable to write files to HDFS
>
> Seems like your Spark cluster doesn't somehow have the Hadoop JARs?
>
> On Wed, Dec 16, 2020 at 6:45 AM Loic DESCOTTE <
> loic.desco...@kaizen-solutions.net> wrote:
>
> Hello,
>
> I am using Spark On Kubernetes and I have the following error when I try
> to write data on HDFS : "no filesystem for scheme hdfs"
>
> More details :
>
> I am submitting my application with Spark submit like this :
>
> spark-submit --master k8s://https://myK8SMaster:6443 \
> --deploy-mode cluster \
> --name hello-spark \
> --class Hello \
> --conf spark.executor.instances=2 \
> --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.container.image=gradiant/spark:2.4.4
> hdfs://hdfs-namenode/user/loic/jars/helloSpark.jar
>
> Then the driver and the 2 executors are created in K8S.
>
> But it fails when I look at the logs of the driver, I see this :
>
> Exception in thread "main" java.io.IOException: No FileSystem for scheme:
> hfds
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at
> org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:424)
> at
> org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:524)
> at
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
> at Hello$.main(hello.scala:24)
> at Hello.main(hello.scala)
>
>
> As you can see , my application jar helloSpark.jar file is correctly
> loaded on HDFS by the Spark submit, but writing to HDFS fails.
>
> I have also tried to add the hadoop client dand hdfs dependencies in the
> spark submit command:
>
> --packages
> org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \
>
> But the error is still here.
>
>
> Here is the Scala code of my application :
>
>
> import java.util.Calendar
>
> import org.apache.spark.sql.SparkSession
>
> case class Data(singleField: String)
>
> object Hello
> {
> def main(args: Array[String])
> {
>
> val spark = SparkSession
>   .builder()
>   .appName("Hello Spark")
>   .getOrCreate()
>
> import spark.implicits._
>
> val now = Calendar.getInstance().getTime().toString
> val data = List(Data(now)).toDF()
>
> data.write.mode("overwrite").format("text").save("hfds://hdfs-namenode/user/loic/result.txt")
> }
> }
>
> Thanks for your help,
> Loïc
>
>


RE: Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread Loic DESCOTTE
So I've tried several other things, including building a fat jar with hdfs 
dependency inside my app jar, and added this to the Spark configuration in the 
code :

val spark = SparkSession
  .builder()
  .appName("Hello Spark 7")
  .config("fs.hdfs.impl", 
classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
  .getOrCreate()


But still the same error...


De : Sean Owen 
Envoyé : mercredi 16 décembre 2020 14:27
À : Loic DESCOTTE 
Objet : Re: Spark on Kubernetes : unable to write files to HDFS

I think it'll have to be part of the Spark distro, but I'm not 100% sure. I 
also think these get registered via manifest files in the JARs; if some process 
is stripping those when creating a bundled up JAR, could be it. Could be that 
it's failing to initialize too for some reason.

On Wed, Dec 16, 2020 at 7:24 AM Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>> 
wrote:
I've tried with this spark-submit option :

--packages 
org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \

But it did't solve the issue.
Should I add more jars?

Thanks
Loïc

De : Sean Owen mailto:sro...@gmail.com>>
Envoyé : mercredi 16 décembre 2020 14:20
À : Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>>
Objet : Re: Spark on Kubernetes : unable to write files to HDFS

Seems like your Spark cluster doesn't somehow have the Hadoop JARs?

On Wed, Dec 16, 2020 at 6:45 AM Loic DESCOTTE 
mailto:loic.desco...@kaizen-solutions.net>> 
wrote:
Hello,

I am using Spark On Kubernetes and I have the following error when I try to 
write data on HDFS : "no filesystem for scheme hdfs"

More details :

I am submitting my application with Spark submit like this :

spark-submit --master k8s://https://myK8SMaster:6443 \
--deploy-mode cluster \
--name hello-spark \
--class Hello \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=gradiant/spark:2.4.4 
hdfs://hdfs-namenode/user/loic/jars/helloSpark.jar

Then the driver and the 2 executors are created in K8S.

But it fails when I look at the logs of the driver, I see this :

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hfds
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at 
org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:424)
at 
org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:524)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at Hello$.main(hello.scala:24)
at Hello.main(hello.scala)


As you can see , my application jar helloSpark.jar file is correctly loaded on 
HDFS by the Spark submit, but writing to HDFS fails.

I have also tried to add the hadoop client dand hdfs dependencies in the spark 
submit command:

--packages 
org.apache.hadoop:hadoop-client:2.6.5,org.apache.hadoop:hadoop-hdfs:2.6.5 \

But the error is still here.


Here is the Scala code of my application :


import java.util.Calendar

import org.apache.spark.sql.SparkSession

case class Data(singleField: String)

object Hello
{
def main(args: Array[String])
{

val spark = SparkSession
  .builder()
  .appName("Hello Spark")
  .getOrCreate()

import spark.implicits._

val now = Calendar.getInstance().getTime().toString
val data = List(Data(now)).toDF()

data.write.mode("overwrite").format("text").save("hfds://hdfs-namenode/user/loic/result.txt")
}
}

Thanks for your help,
Loïc


Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

2020-07-12 Thread Prashant Sharma
Driver HA, is not yet available in k8s mode. It can be a good area, to
work. I want to take a look at it. I personally refer to spark official
documentation for reference.
Thanks,



On Fri, Jul 10, 2020, 9:30 PM Varshney, Vaibhav <
vaibhav.varsh...@siemens.com> wrote:

> Hi Prashant,
>
>
>
> It sounds encouraging. During scale down of the cluster, probably few of
> the spark jobs are impacted due to re-computation of shuffle data. This is
> not of supreme importance for us for now.
>
> Is there any reference deployment architecture available, which is HA ,
> scalable and dynamic-allocation-enabled for deploying Spark on K8s? Any
> suggested github repo or link?
>
>
>
> Thanks,
>
> Vaibhav V
>
>
>
>
>
> *From:* Prashant Sharma 
> *Sent:* Friday, July 10, 2020 12:57 AM
> *To:* user@spark.apache.org
> *Cc:* Sean Owen ; Ramani, Sai (DI SW CAS MP AFC ARC) <
> sai.ram...@siemens.com>; Varshney, Vaibhav (DI SW CAS MP AFC ARC) <
> vaibhav.varsh...@siemens.com>
> *Subject:* Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production
> deployment
>
>
>
> Hi,
>
>
>
> Whether it is a blocker or not, is upto you to decide. But, spark k8s
> cluster supports dynamic allocation, through a different mechanism, that
> is, without using an external shuffle service.
> https://issues.apache.org/jira/browse/SPARK-27963. There are pros and
> cons of both approaches. The only disadvantage of scaling without external
> shuffle service is, when the cluster scales down or it loses executors due
> to some external cause ( for example losing spot instances), we lose the
> shuffle data (data that was computed as an intermediate to some overall
> computation) on that executor. This situation may not lead to data loss, as
> spark can recompute the lost shuffle data.
>
>
>
> Dynamically, scaling up and down scaling, is helpful when the spark
> cluster is running off, "spot instances on AWS" for example or when the
> size of data is not known in advance. In other words, we cannot estimate
> how much resources would be needed to process the data. Dynamic scaling,
> lets the cluster increase its size only based on the number of pending
> tasks, currently this is the only metric implemented.
>
>
>
> I don't think it is a blocker for my production use cases.
>
>
>
> Thanks,
>
> Prashant
>
>
>
> On Fri, Jul 10, 2020 at 2:06 AM Varshney, Vaibhav <
> vaibhav.varsh...@siemens.com> wrote:
>
> Thanks for response. We have tried it in dev env. For production, if Spark
> 3.0 is not leveraging k8s scheduler, then would Spark Cluster in K8s be
> "static"?
> As per https://issues.apache.org/jira/browse/SPARK-24432 it seems it is
> still blocker for production workloads?
>
> Thanks,
> Vaibhav V
>
> -Original Message-
> From: Sean Owen 
> Sent: Thursday, July 9, 2020 3:20 PM
> To: Varshney, Vaibhav (DI SW CAS MP AFC ARC)  >
> Cc: user@spark.apache.org; Ramani, Sai (DI SW CAS MP AFC ARC) <
> sai.ram...@siemens.com>
> Subject: Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production
> deployment
>
> I haven't used the K8S scheduler personally, but, just based on that
> comment I wouldn't worry too much. It's been around for several versions
> and AFAIK works fine in general. We sometimes aren't so great about
> removing "experimental" labels. That said I know there are still some
> things that could be added to it and more work going on, and maybe people
> closer to that work can comment. But yeah you shouldn't be afraid to try it.
>
> On Thu, Jul 9, 2020 at 3:18 PM Varshney, Vaibhav <
> vaibhav.varsh...@siemens.com> wrote:
> >
> > Hi Spark Experts,
> >
> >
> >
> > We are trying to deploy spark on Kubernetes.
> >
> > As per doc
> http://spark.apache.org/docs/latest/running-on-kubernetes.html, it looks
> like K8s deployment is experimental.
> >
> > "The Kubernetes scheduler is currently experimental ".
> >
> >
> >
> > Spark 3.0 does not support production deployment using k8s scheduler?
> >
> > What’s the plan on full support of K8s scheduler?
> >
> >
> >
> > Thanks,
> >
> > Vaibhav V
>
>


RE: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

2020-07-10 Thread Varshney, Vaibhav
Hi Prashant,

It sounds encouraging. During scale down of the cluster, probably few of the 
spark jobs are impacted due to re-computation of shuffle data. This is not of 
supreme importance for us for now.
Is there any reference deployment architecture available, which is HA , 
scalable and dynamic-allocation-enabled for deploying Spark on K8s? Any 
suggested github repo or link?

Thanks,
Vaibhav V


From: Prashant Sharma 
Sent: Friday, July 10, 2020 12:57 AM
To: user@spark.apache.org
Cc: Sean Owen ; Ramani, Sai (DI SW CAS MP AFC ARC) 
; Varshney, Vaibhav (DI SW CAS MP AFC ARC) 

Subject: Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

Hi,

Whether it is a blocker or not, is upto you to decide. But, spark k8s cluster 
supports dynamic allocation, through a different mechanism, that is, without 
using an external shuffle service. 
https://issues.apache.org/jira/browse/SPARK-27963. There are pros and cons of 
both approaches. The only disadvantage of scaling without external shuffle 
service is, when the cluster scales down or it loses executors due to some 
external cause ( for example losing spot instances), we lose the shuffle data 
(data that was computed as an intermediate to some overall computation) on that 
executor. This situation may not lead to data loss, as spark can recompute the 
lost shuffle data.

Dynamically, scaling up and down scaling, is helpful when the spark cluster is 
running off, "spot instances on AWS" for example or when the size of data is 
not known in advance. In other words, we cannot estimate how much resources 
would be needed to process the data. Dynamic scaling, lets the cluster increase 
its size only based on the number of pending tasks, currently this is the only 
metric implemented.

I don't think it is a blocker for my production use cases.

Thanks,
Prashant

On Fri, Jul 10, 2020 at 2:06 AM Varshney, Vaibhav 
mailto:vaibhav.varsh...@siemens.com>> wrote:
Thanks for response. We have tried it in dev env. For production, if Spark 3.0 
is not leveraging k8s scheduler, then would Spark Cluster in K8s be "static"?
As per https://issues.apache.org/jira/browse/SPARK-24432 it seems it is still 
blocker for production workloads?

Thanks,
Vaibhav V

-Original Message-
From: Sean Owen mailto:sro...@gmail.com>>
Sent: Thursday, July 9, 2020 3:20 PM
To: Varshney, Vaibhav (DI SW CAS MP AFC ARC) 
mailto:vaibhav.varsh...@siemens.com>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>; Ramani, Sai (DI SW CAS 
MP AFC ARC) mailto:sai.ram...@siemens.com>>
Subject: Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

I haven't used the K8S scheduler personally, but, just based on that comment I 
wouldn't worry too much. It's been around for several versions and AFAIK works 
fine in general. We sometimes aren't so great about removing "experimental" 
labels. That said I know there are still some things that could be added to it 
and more work going on, and maybe people closer to that work can comment. But 
yeah you shouldn't be afraid to try it.

On Thu, Jul 9, 2020 at 3:18 PM Varshney, Vaibhav 
mailto:vaibhav.varsh...@siemens.com>> wrote:
>
> Hi Spark Experts,
>
>
>
> We are trying to deploy spark on Kubernetes.
>
> As per doc http://spark.apache.org/docs/latest/running-on-kubernetes.html, it 
> looks like K8s deployment is experimental.
>
> "The Kubernetes scheduler is currently experimental ".
>
>
>
> Spark 3.0 does not support production deployment using k8s scheduler?
>
> What’s the plan on full support of K8s scheduler?
>
>
>
> Thanks,
>
> Vaibhav V


Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

2020-07-09 Thread Prashant Sharma
Hi,

Whether it is a blocker or not, is upto you to decide. But, spark k8s
cluster supports dynamic allocation, through a different mechanism, that
is, without using an external shuffle service.
https://issues.apache.org/jira/browse/SPARK-27963. There are pros and cons
of both approaches. The only disadvantage of scaling without external
shuffle service is, when the cluster scales down or it loses executors due
to some external cause ( for example losing spot instances), we lose the
shuffle data (data that was computed as an intermediate to some overall
computation) on that executor. This situation may not lead to data loss, as
spark can recompute the lost shuffle data.

Dynamically, scaling up and down scaling, is helpful when the spark cluster
is running off, "spot instances on AWS" for example or when the size of
data is not known in advance. In other words, we cannot estimate how much
resources would be needed to process the data. Dynamic scaling, lets the
cluster increase its size only based on the number of pending tasks,
currently this is the only metric implemented.

I don't think it is a blocker for my production use cases.

Thanks,
Prashant

On Fri, Jul 10, 2020 at 2:06 AM Varshney, Vaibhav <
vaibhav.varsh...@siemens.com> wrote:

> Thanks for response. We have tried it in dev env. For production, if Spark
> 3.0 is not leveraging k8s scheduler, then would Spark Cluster in K8s be
> "static"?
> As per https://issues.apache.org/jira/browse/SPARK-24432 it seems it is
> still blocker for production workloads?
>
> Thanks,
> Vaibhav V
>
> -Original Message-
> From: Sean Owen 
> Sent: Thursday, July 9, 2020 3:20 PM
> To: Varshney, Vaibhav (DI SW CAS MP AFC ARC)  >
> Cc: user@spark.apache.org; Ramani, Sai (DI SW CAS MP AFC ARC) <
> sai.ram...@siemens.com>
> Subject: Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production
> deployment
>
> I haven't used the K8S scheduler personally, but, just based on that
> comment I wouldn't worry too much. It's been around for several versions
> and AFAIK works fine in general. We sometimes aren't so great about
> removing "experimental" labels. That said I know there are still some
> things that could be added to it and more work going on, and maybe people
> closer to that work can comment. But yeah you shouldn't be afraid to try it.
>
> On Thu, Jul 9, 2020 at 3:18 PM Varshney, Vaibhav <
> vaibhav.varsh...@siemens.com> wrote:
> >
> > Hi Spark Experts,
> >
> >
> >
> > We are trying to deploy spark on Kubernetes.
> >
> > As per doc
> http://spark.apache.org/docs/latest/running-on-kubernetes.html, it looks
> like K8s deployment is experimental.
> >
> > "The Kubernetes scheduler is currently experimental ".
> >
> >
> >
> > Spark 3.0 does not support production deployment using k8s scheduler?
> >
> > What’s the plan on full support of K8s scheduler?
> >
> >
> >
> > Thanks,
> >
> > Vaibhav V
>


RE: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

2020-07-09 Thread Varshney, Vaibhav
Thanks for response. We have tried it in dev env. For production, if Spark 3.0 
is not leveraging k8s scheduler, then would Spark Cluster in K8s be "static"? 
As per https://issues.apache.org/jira/browse/SPARK-24432 it seems it is still 
blocker for production workloads?

Thanks,
Vaibhav V

-Original Message-
From: Sean Owen  
Sent: Thursday, July 9, 2020 3:20 PM
To: Varshney, Vaibhav (DI SW CAS MP AFC ARC) 
Cc: user@spark.apache.org; Ramani, Sai (DI SW CAS MP AFC ARC) 

Subject: Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

I haven't used the K8S scheduler personally, but, just based on that comment I 
wouldn't worry too much. It's been around for several versions and AFAIK works 
fine in general. We sometimes aren't so great about removing "experimental" 
labels. That said I know there are still some things that could be added to it 
and more work going on, and maybe people closer to that work can comment. But 
yeah you shouldn't be afraid to try it.

On Thu, Jul 9, 2020 at 3:18 PM Varshney, Vaibhav  
wrote:
>
> Hi Spark Experts,
>
>
>
> We are trying to deploy spark on Kubernetes.
>
> As per doc http://spark.apache.org/docs/latest/running-on-kubernetes.html, it 
> looks like K8s deployment is experimental.
>
> "The Kubernetes scheduler is currently experimental ".
>
>
>
> Spark 3.0 does not support production deployment using k8s scheduler?
>
> What’s the plan on full support of K8s scheduler?
>
>
>
> Thanks,
>
> Vaibhav V


Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

2020-07-09 Thread Sean Owen
I haven't used the K8S scheduler personally, but, just based on that
comment I wouldn't worry too much. It's been around for several
versions and AFAIK works fine in general. We sometimes aren't so great
about removing "experimental" labels. That said I know there are still
some things that could be added to it and more work going on, and
maybe people closer to that work can comment. But yeah you shouldn't
be afraid to try it.

On Thu, Jul 9, 2020 at 3:18 PM Varshney, Vaibhav
 wrote:
>
> Hi Spark Experts,
>
>
>
> We are trying to deploy spark on Kubernetes.
>
> As per doc http://spark.apache.org/docs/latest/running-on-kubernetes.html, it 
> looks like K8s deployment is experimental.
>
> "The Kubernetes scheduler is currently experimental ".
>
>
>
> Spark 3.0 does not support production deployment using k8s scheduler?
>
> What’s the plan on full support of K8s scheduler?
>
>
>
> Thanks,
>
> Vaibhav V

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on kubernetes : missing spark.kubernetes.driver.request.cores parameter ?

2019-10-04 Thread jcdauchy
I am actually answering myself as I have check on master 3.x branch, and
there is this feature !

https://issues.apache.org/jira/browse/SPARK-27754

So my understanding was correct.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: Spark on Kubernetes - log4j.properties not read

2019-06-11 Thread Dave Jaffe
That did the trick, Abhishek! Thanks for the explanation, that answered a lot
of questions I had.

Dave



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: Spark on Kubernetes - log4j.properties not read

2019-06-10 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi Dave,

As part of driver pod bringup, a configmap is created using all the spark 
configuration parameters (with name spark.properties) and mounted to 
/opt/spark/conf. So all the other files present in /opt/spark/conf will be 
overwritten.
Same is happening with the log4j.properties in this case. You could try to 
build the container by placing the log4j.properties at some other location and 
set the same in spark.driver.extraJavaOptions

Thanks and Regards,
Abhishek

From: Dave Jaffe 
Sent: Tuesday, June 11, 2019 6:45 AM
To: user@spark.apache.org
Subject: Spark on Kubernetes - log4j.properties not read

I am using Spark on Kubernetes from Spark 2.4.3. I have created a 
log4j.properties file in my local spark/conf directory and modified it so that 
the console (or, in the case of Kubernetes, the log) only shows warnings and 
higher (log4j.rootCategory=WARN, console). I then added the command
COPY conf /opt/spark/conf
to /root/spark/kubernetes/dockerfiles/spark/Dockerfile and built a new 
container.

However, when I run that under Kubernetes, the program runs successfully but 
/opt/spark/conf/log4j.properties is not used (I still see the INFO lines when I 
run kubectl logs ).

I have tried other things such as explicitly adding a –properties-file to my 
spark-submit command and even
--conf 
spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///opt/spark/conf/log4j.properties

My log4j.properties file is never seen.

How do I customize log4j.properties with Kubernetes?

Thanks, Dave Jaffe



Re: Spark with Kubernetes connecting to pod ID, not address

2019-02-13 Thread Pat Ferrel
solve(SimpleNameResolver.java:55)
at 
io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
at 
io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
at 
io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at 
io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
at 
io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:978)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482)
at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more



From: Erik Erlandson 
Date: February 13, 2019 at 4:57:30 AM
To: Pat Ferrel 
Subject:  Re: Spark with Kubernetes connecting to pod id, not address  

Hi Pat,

I'd suggest visiting the big data slack channel, it's a more spark oriented 
forum than kube-dev:
https://kubernetes.slack.com/messages/C0ELB338T/

Tentatively, I think you may want to submit in client mode (unless you are 
initiating your application from outside the kube cluster). When in client 
mode, you need to set up a headless service for the application driver pod that 
the executors can use to talk back to the driver.
https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode

Cheers,
Erik


On Wed, Feb 13, 2019 at 1:55 AM Pat Ferrel  wrote:
We have a k8s deployment of several services including Apache Spark. All 
services seem to be operational. Our application connects to the Spark master 
to submit a job using the k8s DNS service for the cluster where the master is 
called spark-api so we use master=spark://spark-api:7077 and we use 
spark.submit.deployMode=cluster. We submit the job through the API not by the 
spark-submit script. 

This will run the "driver" and all "executors" on the cluster and this part 
seems to work but there is a callback to the launching code in our app from 
some Spark process. For some reason it is trying to connect to 
harness-64d97d6d6-4r4d8, which is the pod ID, not the k8s cluster IP or DNS.

How could this pod ID be getting into the system? Spark somehow seems to think 
it is the address of the service that called it. Needless to say any connection 
to the k8s pod ID fails and so does the job.

Any idea how Spark could think the pod ID is an IP address or DNS name? 

BTW if we run a small sample job with `master=local` all is well, but the same 
job executed with the above config tries to connect to the spurious pod ID.
--
You received this message because you are subscribed to the Google Groups 
"Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to kubernetes-dev+unsubscr...@googlegroups.com.
To post to this group, send email to kubernetes-...@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-dev.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/kubernetes-dev/36bb6bf8-1cac-428e-8ad7-3d639c90a86b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [Spark for kubernetes] Azure Blob Storage credentials issue

2018-10-24 Thread Matt Cheah
Hi there,

 

Can you check if HADOOP_CONF_DIR is being set on the executors to 
/opt/spark/conf? One should set an executor environment variable for that.

 

A kubectl describe pod output for the executors would be helpful here.

 

-Matt Cheah

 

From: Oscar Bonilla 
Date: Friday, October 19, 2018 at 1:03 AM
To: "user@spark.apache.org" 
Subject: [Spark for kubernetes] Azure Blob Storage credentials issue

 

Hello,

I'm having the following issue while trying to run Spark for kubernetes 
[spark.apache.org]:
2018-10-18 08:48:54 INFO  DAGScheduler:54 - Job 0 failed: reduce at 
SparkPi.scala:38, took 1.743177 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost 
task 0.3 in stage 0.0 (TID 6, 10.244.1.11, executor 2): 
org.apache.hadoop.fs.azure.AzureException: 
org.apache.hadoop.fs.azure.AzureException: No credentials found for account 
datasets83d858296fd0c49b.blob.core.windows.net 
[datasets83d858296fd0c49b.blob.core.windows.net] in the configuration, and its 
container datasets is not accessible using anonymous credentials. Please check 
if the container exists first. If it is not publicly available, you have to 
provide account credentials.
    at 
org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1086)
    at 
org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:538)
    at 
org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1366)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3242)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3291)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3259)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:470)
    at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1897)
    at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:694)
    at org.apache.spark.util.Utils$.fetchFile(Utils.scala:476)
    at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:755)
    at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:747)
    at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
    at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
    at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
    at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
    at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
    at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
    at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
    at org.apache.spark.executor.Executor.org 
[org.apache.spark.executor.executor.org]$apache$spark$executor$Executor$$updateDependencies(Executor.scala:747)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:312)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.fs.azure.AzureException: No credentials found for 
account datasets83d858296fd0c49b.blob.core.windows.net 
[datasets83d858296fd0c49b.blob.core.windows.net] in the configuration, and its 
container datasets is not accessible using anonymous credentials. Please check 
if the container exists first. If it is not publicly available, you have to 
provide account credentials.
    at 
org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.connectUsingAnonymousCredentials(AzureNativeFileSystemStore.java:863)
    at 
org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1081)
    ... 24 more
The command I use to launch the job is:
/opt/spark/bin/spark-submit
    --master k8s://
    --deploy-mode cluster
    --name spark-pi
    --class org.apache.spark.examples.SparkPi
    --conf spark.executor.instances=5
    --conf spark.kubernetes.container.image=
    --conf spark.kubernetes.namespace=
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
    --conf spark.kubernetes.driver.secrets.spark=/opt/spark/conf
    --conf spark.kubernetes.executor.secrets.spark=/opt/spark/conf
wasb://@.blob.core.windows.net/spark-examples_2.11-2.3.2.jar
 [blob.core.windows.net] 1
I have a k8s secret named spark with the following content:
apiVersion: v1
kind: Secret
metadata:
  name: spark
  labels:
    app: spark
    stack: service
type: Opaque
data:
  core-site.xml: |-
    {% filter b64encode %}
    
    
  

Re: Spark on Kubernetes: Kubernetes killing executors because of overallocation of memory

2018-08-02 Thread Matt Cheah
Hi there,

 

You may want to look at setting the memory overhead settings higher. Spark will 
then start containers with a higher memory limit (spark.executor.memory + 
spark.executor.memoryOverhead, to be exact) while the heap is still locked to 
spark.executor.memory. There’s some memory used by offheap storage from Spark 
that won’t be accounted for in just the heap size.

 

Hope this helps,

 

-Matt Cheah

 

From: Jayesh Lalwani 
Date: Thursday, August 2, 2018 at 12:35 PM
To: "user@spark.apache.org" 
Subject: Spark on Kubernetes: Kubernetes killing executors because of 
overallocation of memory

 

We are running Spark 2.3 on a Kubernetes cluster. We have set the following 
spark configuration options

"spark.executor.memory": "7g",

"spark.driver.memory": "2g",

"spark.memory.fraction": "0.75"

 

WHat we see is

a) In the SPark UI, 5G has been allocated to each executor, which makes sense 
because we set spark.memory.fraction=0.75
b) Kubernetes reports the pod memory usage as 7.6G

 

WHen we run a lot of jobs on the Kubernetes cluster, Kubernetes starts killing 
the executor pods, because it thinks that the pod is misbehaving.

 

We logged into a running pod, and ran the top command, and most of the 7.6G is 
being allocated to the executor's java process

 

Why is Spark taking 7.6G instead of 7 G? Where is the 600MB being allocated to? 
Is there some configuration that controls how much of the executor memory gets 
allocated to Permgen vs the memory that gets allocated to the heap?

 

 

 

The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.



smime.p7s
Description: S/MIME cryptographic signature


Re: Spark on Kubernetes (minikube) 2.3 fails with class not found exception

2018-04-10 Thread Marcelo Vanzin
This is the problem:

> :/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar;/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar

Seems like some code is confusing things when mixing OSes. It's using
the Windows separator when building a command line ti be run on a
Linux host.


On Tue, Apr 10, 2018 at 11:02 AM, Dmitry  wrote:
> Previous example was bad paste( I tried a lot of variants, so sorry for
> wrong paste )
> PS C:\WINDOWS\system32> spark-submit --master k8s://https://ip:8443
> --deploy-mode cluster  --name spark-pi --class
> org.apache.spark.examples.SparkPi --conf spark.executor.instances=1
> --executor-memory 1G --conf spark.kubernete
> s.container.image=andrusha/spark-k8s:2.3.0-hadoop2.7
> local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
> Returns
> Image:
> andrusha/spark-k8s:2.3.0-hadoop2.7
> Environment variables:
> SPARK_DRIVER_MEMORY: 1g
> SPARK_DRIVER_CLASS: org.apache.spark.examples.SparkPi
> SPARK_DRIVER_ARGS:
> SPARK_DRIVER_BIND_ADDRESS:
> SPARK_MOUNTED_CLASSPATH:
> /opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar;/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
> SPARK_JAVA_OPT_0:
> -Dspark.kubernetes.driver.pod.name=spark-pi-46f48a0974d43341886076bc3c5f31c4-driver
> SPARK_JAVA_OPT_1:
> -Dspark.kubernetes.executor.podNamePrefix=spark-pi-46f48a0974d43341886076bc3c5f31c4
> SPARK_JAVA_OPT_2: -Dspark.app.name=spark-pi
> SPARK_JAVA_OPT_3:
> -Dspark.driver.host=spark-pi-46f48a0974d43341886076bc3c5f31c4-driver-svc.default.svc
> SPARK_JAVA_OPT_4: -Dspark.submit.deployMode=cluster
> SPARK_JAVA_OPT_5: -Dspark.driver.blockManager.port=7079
> SPARK_JAVA_OPT_6: -Dspark.master=k8s://https://ip:8443
> SPARK_JAVA_OPT_7:
> -Dspark.jars=/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar,/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
> SPARK_JAVA_OPT_8:
> -Dspark.kubernetes.container.image=andrusha/spark-k8s:2.3.0-hadoop2.7
> SPARK_JAVA_OPT_9: -Dspark.executor.instances=1
> SPARK_JAVA_OPT_10: -Dspark.app.id=spark-16eb67d8953e418aba96c2d12deecd11
> SPARK_JAVA_OPT_11: -Dspark.executor.memory=1G
> SPARK_JAVA_OPT_12: -Dspark.driver.port=7078
>
>
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS
> $SPARK_DRIVER_ARGS)
> + exec /sbin/tini -s -- /usr/lib/jvm/java-1.8-openjdk/bin/java
> -Dspark.app.id=spark-16eb67d8953e418aba96c2d12deecd11
> -Dspark.executor.memory=1G -Dspark.driver.port=7078
> -Dspark.driver.blockManager.port=7079 -Dspark.submit.deployMode=cluster
> -Dspark.jars=/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar,/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
> -Dspark.master=k8s://https://172.20.10.12:8443
> -Dspark.kubernetes.executor.podNamePrefix=spark-pi-46f48a0974d43341886076bc3c5f31c4
> -Dspark.kubernetes.driver.pod.name=spark-pi-46f48a0974d43341886076bc3c5f31c4-driver
> -Dspark.driver.host=spark-pi-46f48a0974d43341886076bc3c5f31c4-driver-svc.default.svc
> -Dspark.app.name=spark-pi -Dspark.executor.instances=1
> -Dspark.kubernetes.container.image=andrusha/spark-k8s:2.3.0-hadoop2.7 -cp
> ':/opt/spark/jars/*:/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar;/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
> -Xms1g -Xmx1g -Dspark.driver.bindAddress=172.17.0.2
> org.apache.spark.examples.SparkPi
> Error: Could not find or load main class org.apache.spark.examples.SparkPi
>
> Found this stackoverflow question
> https://stackoverflow.com/questions/49331570/spark-2-3-minikube-kubernetes-windows-demo-sparkpi-not-found
> but there is no answer.
> I also checked container file system, it contains
> /opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
>
>
> 2018-04-11 1:17 GMT+08:00 Yinan Li :
>>
>> The example jar path should be
>> local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar.
>>
>> On Tue, Apr 10, 2018 at 1:34 AM, Dmitry  wrote:
>>>
>>> Hello spent a lot of time to find what I did wrong , but not found.
>>> I have a minikube WIndows based cluster ( Hyper V as hypervisor ) and try
>>> to run examples against Spark 2.3. Tried several  docker images builds:
>>> * several  builds that I build myself
>>> * andrusha/spark-k8s:2.3.0-hadoop2.7 from docker  hub
>>> But when I try to submit job driver log returns  class not found
>>> exception
>>> org.apache.spark.examples.SparkPi
>>>
>>> spark-submit --master k8s://https://ip:8443  --deploy-mode cluster
>>> --name spark-pi --class org.apache.spark.examples.SparkPi --conf
>>> spark.executor.instances=1 --executor-memory 1G --conf spark.kubernete
>>> s.container.image=andrusha/spark-k8s:2.3.0-hadoop2.7
>>> local:///opt/spark/examples/spark-examples_2.11-2.3.0.jar
>>>
>>> I tried to use https://github.com/apache-spark-on-k8s/spark fork and it
>>> is works without problems, more complex examples work also.
>>
>>
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark on Kubernetes (minikube) 2.3 fails with class not found exception

2018-04-10 Thread Dmitry
Previous example was bad paste( I tried a lot of variants, so sorry for
wrong paste )
PS C:\WINDOWS\system32> spark-submit --master k8s://https://ip:8443
--deploy-mode cluster  --name spark-pi --class
org.apache.spark.examples.SparkPi
--conf spark.executor.instances=1 --executor-memory 1G --conf
spark.kubernete
s.container.image=andrusha/spark-k8s:2.3.0-hadoop2.7
local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
Returns
Image:
andrusha/spark-k8s:2.3.0-hadoop2.7
Environment variables:
SPARK_DRIVER_MEMORY: 1g
SPARK_DRIVER_CLASS: org.apache.spark.examples.SparkPi
SPARK_DRIVER_ARGS:
SPARK_DRIVER_BIND_ADDRESS:
SPARK_MOUNTED_CLASSPATH:
/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar;/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
SPARK_JAVA_OPT_0: -Dspark.kubernetes.driver.pod.name
=spark-pi-46f48a0974d43341886076bc3c5f31c4-driver
SPARK_JAVA_OPT_1:
-Dspark.kubernetes.executor.podNamePrefix=spark-pi-46f48a0974d43341886076bc3c5f31c4
SPARK_JAVA_OPT_2: -Dspark.app.name=spark-pi
SPARK_JAVA_OPT_3:
-Dspark.driver.host=spark-pi-46f48a0974d43341886076bc3c5f31c4-driver-svc.default.svc
SPARK_JAVA_OPT_4: -Dspark.submit.deployMode=cluster
SPARK_JAVA_OPT_5: -Dspark.driver.blockManager.port=7079
SPARK_JAVA_OPT_6: -Dspark.master=k8s://https://ip:8443
SPARK_JAVA_OPT_7:
-Dspark.jars=/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar,/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
SPARK_JAVA_OPT_8:
-Dspark.kubernetes.container.image=andrusha/spark-k8s:2.3.0-hadoop2.7
SPARK_JAVA_OPT_9: -Dspark.executor.instances=1
SPARK_JAVA_OPT_10: -Dspark.app.id=spark-16eb67d8953e418aba96c2d12deecd11
SPARK_JAVA_OPT_11: -Dspark.executor.memory=1G
SPARK_JAVA_OPT_12: -Dspark.driver.port=7078


-Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS
$SPARK_DRIVER_ARGS)
+ exec /sbin/tini -s -- /usr/lib/jvm/java-1.8-openjdk/bin/java -
Dspark.app.id=spark-16eb67d8953e418aba96c2d12deecd11
-Dspark.executor.memory=1G -Dspark.driver.port=7078
-Dspark.driver.blockManager.port=7079 -Dspark.submit.deployMode=cluster
-Dspark.jars=/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar,/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
-Dspark.master=k8s://https://172.20.10.12:8443
-Dspark.kubernetes.executor.podNamePrefix=spark-pi-46f48a0974d43341886076bc3c5f31c4
-Dspark.kubernetes.driver.pod.name=spark-pi-46f48a0974d43341886076bc3c5f31c4-driver
-Dspark.driver.host=spark-pi-46f48a0974d43341886076bc3c5f31c4-driver-svc.default.svc
-Dspark.app.name=spark-pi -Dspark.executor.instances=1
-Dspark.kubernetes.container.image=andrusha/spark-k8s:2.3.0-hadoop2.7 -cp
':/opt/spark/jars/*:/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar;/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar'
-Xms1g -Xmx1g -Dspark.driver.bindAddress=172.17.0.2
org.apache.spark.examples.SparkPi
Error: Could not find or load main class org.apache.spark.examples.SparkPi

Found this stackoverflow question
https://stackoverflow.com/questions/49331570/spark-2-3-minikube-kubernetes-windows-demo-sparkpi-not-found
but there is no answer.
I also checked container file system, it contains
/opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar



2018-04-11 1:17 GMT+08:00 Yinan Li :

> The example jar path should be local:///opt/spark/examples/*jars*
> /spark-examples_2.11-2.3.0.jar.
>
> On Tue, Apr 10, 2018 at 1:34 AM, Dmitry  wrote:
>
>> Hello spent a lot of time to find what I did wrong , but not found.
>> I have a minikube WIndows based cluster ( Hyper V as hypervisor ) and try
>> to run examples against Spark 2.3. Tried several  docker images builds:
>> * several  builds that I build myself
>> * andrusha/spark-k8s:2.3.0-hadoop2.7 from docker  hub
>> But when I try to submit job driver log returns  class not found exception
>> org.apache.spark.examples.SparkPi
>>
>> spark-submit --master k8s://https://ip:8443  --deploy-mode cluster
>> --name spark-pi --class org.apache.spark.examples.SparkPi --conf
>> spark.executor.instances=1 --executor-memory 1G --conf spark.kubernete
>> s.container.image=andrusha/spark-k8s:2.3.0-hadoop2.7
>> local:///opt/spark/examples/spark-examples_2.11-2.3.0.jar
>>
>> I tried to use https://github.com/apache-spark-on-k8s/spark fork and it
>> is works without problems, more complex examples work also.
>>
>
>


Re: Spark on Kubernetes (minikube) 2.3 fails with class not found exception

2018-04-10 Thread Yinan Li
The example jar path should be local:///opt/spark/examples/*jars*
/spark-examples_2.11-2.3.0.jar.

On Tue, Apr 10, 2018 at 1:34 AM, Dmitry  wrote:

> Hello spent a lot of time to find what I did wrong , but not found.
> I have a minikube WIndows based cluster ( Hyper V as hypervisor ) and try
> to run examples against Spark 2.3. Tried several  docker images builds:
> * several  builds that I build myself
> * andrusha/spark-k8s:2.3.0-hadoop2.7 from docker  hub
> But when I try to submit job driver log returns  class not found exception
> org.apache.spark.examples.SparkPi
>
> spark-submit --master k8s://https://ip:8443  --deploy-mode cluster
> --name spark-pi --class org.apache.spark.examples.SparkPi --conf
> spark.executor.instances=1 --executor-memory 1G --conf spark.kubernete
> s.container.image=andrusha/spark-k8s:2.3.0-hadoop2.7
> local:///opt/spark/examples/spark-examples_2.11-2.3.0.jar
>
> I tried to use https://github.com/apache-spark-on-k8s/spark fork and it
> is works without problems, more complex examples work also.
>