Re: Announcing Delta Lake 0.2.0

2019-06-20 Thread Li Gao
Lyft recently open sourced a data discovery tool called Amundsen that can
serve many of the data catalog needs.

https://eng.lyft.com/amundsen-lyfts-data-discovery-metadata-engine-62d27254fbb9
https://github.com/lyft/amundsenmetadatalibrary

You still need HMS to store the data schema though.



On Thu, Jun 20, 2019 at 4:47 AM James Cotrotsios 
wrote:

> Is there a plan to have a business catalog component for the Data Lake? If
> not how would someone make a proposal to create an open source project
> related to that. I would be interested in building out an open source data
> catalog that would use the Hive metadata store as a baseline for technical
> metadata.
>
>
> On Wed, Jun 19, 2019 at 3:04 PM Liwen Sun 
> wrote:
>
>> We are delighted to announce the availability of Delta Lake 0.2.0!
>>
>> To try out Delta Lake 0.2.0, please follow the Delta Lake Quickstart:
>> https://docs.delta.io/0.2.0/quick-start.html
>>
>> To view the release notes:
>> https://github.com/delta-io/delta/releases/tag/v0.2.0
>>
>> This release introduces two main features:
>>
>> *Cloud storage support*
>> In addition to HDFS, you can now configure Delta Lake to read and write
>> data on cloud storage services such as Amazon S3 and Azure Blob Storage.
>> For configuration instructions, please see:
>> https://docs.delta.io/0.2.0/delta-storage.html
>>
>> *Improved concurrency*
>> Delta Lake now allows concurrent append-only writes while still ensuring
>> serializability. For concurrency control in Delta Lake, please see:
>> https://docs.delta.io/0.2.0/delta-concurrency.html
>>
>> We have also greatly expanded the test coverage as part of this release.
>>
>> We would like to acknowledge all community members for contributing to
>> this release.
>>
>> Best regards,
>> Liwen Sun
>>
>>


Re: Spark 2.4.1 on Kubernetes - DNS resolution of driver fails

2019-05-02 Thread Li Gao
hi Olivier,

This seems a GKE specific issue? have you tried on other vendors ? Also on
the kubelet nodes did you notice any pressure on the DNS side?

Li


On Mon, Apr 29, 2019, 5:43 AM Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi everyone,
> I have ~300 spark job on Kubernetes (GKE) using the cluster auto-scaler,
> and sometimes while running these jobs a pretty bad thing happens, the
> driver (in cluster mode) gets scheduled on Kubernetes and launches many
> executor pods.
> So far so good, but the k8s "Service" associated to the driver does not
> seem to be propagated in terms of DNS resolution so all the executor fails
> with a "spark-application-..cluster.svc.local" does not exists.
>
> All executors failing the driver should be failing too, but it considers
> that it's a "pending" initial allocation and stay stuck forever in a loop
> of "Initial job has not accepted any resources, please check Cluster UI"
>
> Has anyone else observed this king of behaviour ?
> We had it on 2.3.1 and I upgraded to 2.4.1 but this issue still seems to
> exist even after the "big refactoring" in the kubernetes cluster scheduler
> backend.
>
> I can work on a fix / workaround but I'd like to check with you the proper
> way forward :
>
>- Some processes (like the airflow helm recipe) rely on a "sleep 30s"
>before launching the dependent pods (that could be added to
>/opt/entrypoint.sh used in the kubernetes packing)
>- We can add a simple step to the init container trying to do the DNS
>resolution and failing after 60s if it did not work
>
> But these steps won't change the fact that the driver will stay stuck
> thinking we're still in the case of the Initial allocation delay.
>
> Thoughts ?
>
> --
> *Olivier Girardot*
> o.girar...@lateral-thoughts.com
>


Re: Difference between 'cores' config params: spark submit on k8s

2019-04-20 Thread Li Gao
hi Battini,

The limit is a k8s construct that tells k8s how much cpu/cores your driver
*can* consume.

when you have the same value for 'spark.driver.cores' and '
spark.kubernetes.driver.limit.cores' your driver then runs at the
'Guranteed' k8s quality of service class, which can make your driver less
chance gets evicted by the scheduler.

The same goes with the executor settings.

https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/

the QoS Guarantee is important when you are a mutitenant k8s cluster in
production.

Cheers,
Li


On Thu, Mar 7, 2019, 1:53 PM Battini Lakshman 
wrote:

> Hello,
>
> I understand we need to specify the 'spark.kubernetes.driver.limit.cores'
> and 'spark.kubernetes.executor.limit.cores' config parameters while
> submitting spark on k8s namespace with resource quota applied.
>
> There are also other config parameters 'spark.driver.cores' and
> 'spark.executor.cores' mentioned in documentation. What is the difference
> between '' and 'spark.kubernetes.driver.limit.cores' please.
>
> Thanks!
>
> Best Regards,
> Lakshman B.
>


Re: Spark Kubernetes Architecture: Deployments vs Pods that create Pods

2019-01-30 Thread Li Gao
Hi Wilson,

As Yinan well said, for batch jobs with dynamic scaling requirements and
communication between driver and executor, it does not fit into the service
oriented Deployment paradigm of k8s. Thus we have the need to abstract
these spark specific differences to k8s CRD and CRD controller to manage
the lifecycle of spark batch on k8s:
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator. The CRD makes
the spark job more k8s compliant and repeatable.

Like you discovered, Deployment is typically used for job server type of
services.

-Li


On Tue, Jan 29, 2019 at 1:49 PM Yinan Li  wrote:

> Hi Wilson,
>
> The behavior of a Deployment doesn't fit with the way Spark executor pods
> are run and managed. For example, executor pods are created and deleted per
> the requests from the driver dynamically and normally they run to
> completion. A Deployment assumes uniformity and statelessness of the set of
> Pods it manages, which is not necessarily the case for Spark executors. For
> example, executor Pods have unique executor IDs. Dynamic resource
> allocation doesn't play well with a Deployment as scaling or shrinking the
> number of executor Pods requires a rolling update with a Deployment, which
> means restarting all the executor Pods. In the Kubernetes mode, the driver
> is effectively a custom controller of executor Pods that adds or deletes
> Pods per requests from the driver, and watches the status of the Pods.
>
> The way Flink on Kubernetes works, as you said, is basically running the
> Flink job/task managers using Deployments. A equivalent is running a
> standalone Spark cluster on top of Kubernetes. If you want auto-restart for
> Spark streaming jobs, I would suggest you take a look at the K8S Spark
> Operator .
>
> On Tue, Jan 29, 2019 at 5:53 AM WILSON Frank <
> frank.wil...@uk.thalesgroup.com> wrote:
>
>> Hi,
>>
>>
>>
>> I’ve been playing around with Spark Kubernetes deployments over the past
>> week and I’m curious to know why Spark deploys as a driver pod that creates
>> more worker pods.
>>
>>
>>
>> I’ve read that it’s normal to use Kubernetes Deployments to create a
>> distributed service, so I am wondering why Spark just creates Pods. I
>> suppose the driver program
>>
>> is ‘the odd one out’ so it doesn’t belong in a Deployment or ReplicaSet,
>> but maybe the workers could be Deployment? Is this something to do with
>> data locality?
>>
>>
>>
>> I have tried Streaming pipelines on Kubernetes yet, are these also Pods
>> that create Pods rather than Deployments? It seems more important for a
>> streaming pipeline to be ‘durable’[1] as the Kubernetes documentation might
>> say.
>>
>>
>>
>> I ask this question partly because the Kubernetes deployment of Spark is
>> still experimental and I am wondering whether this aspect of the deployment
>> might change.
>>
>>
>>
>> I had a look at the Flink[2] documentation and it does seem to use
>> Deployments however these seem to be a lightweight job/task manager that
>> accepts Flink jobs. It sounds actually like running a lightweight version
>> YARN inside containers on Kubernetes.
>>
>>
>>
>>
>>
>> Thanks,
>>
>>
>>
>>
>>
>> Frank
>>
>>
>>
>> [1]
>> https://kubernetes.io/docs/concepts/workloads/pods/pod/#durability-of-pods-or-lack-thereof
>>
>> [2]
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/kubernetes.html
>>
>


Re: Spark UI History server on Kubernetes

2019-01-23 Thread Li Gao
In addition to what Rao mentioned, if you are using cloud blob storage such
as AWS S3, you can specify your history location to be an S3 location such
as:  `s3://mybucket/path/to/history`


On Wed, Jan 23, 2019 at 12:55 AM Rao, Abhishek (Nokia - IN/Bangalore) <
abhishek@nokia.com> wrote:

> Hi Lakshman,
>
>
>
> We’ve set these 2 properties to bringup spark history server
>
>
>
> spark.history.fs.logDirectory 
>
> spark.history.ui.port 
>
>
>
> We’re writing the logs to HDFS. In order to write logs, we’re setting
> following properties while submitting the spark job
>
> spark.eventLog.enabled true
>
> spark.eventLog.dir 
>
>
>
> Thanks and Regards,
>
> Abhishek
>
>
>
> *From:* Battini Lakshman 
> *Sent:* Wednesday, January 23, 2019 1:55 PM
> *To:* Rao, Abhishek (Nokia - IN/Bangalore) 
> *Subject:* Re: Spark UI History server on Kubernetes
>
>
>
> HI Abhishek,
>
>
>
> Thank you for your response. Could you please let me know the properties
> you configured for bringing up History Server and its UI.
>
>
>
> Also, are you writing the logs to any directory on persistent storage, if
> yes, could you let me know the changes you did in Spark to write logs to
> that directory. Thanks!
>
>
>
> Best Regards,
>
> Lakshman Battini.
>
>
>
> On Tue, Jan 22, 2019 at 10:53 PM Rao, Abhishek (Nokia - IN/Bangalore) <
> abhishek@nokia.com> wrote:
>
> Hi,
>
>
>
> We’ve setup spark-history service (based on spark 2.4) on K8S. UI works
> perfectly fine when running on NodePort. We’re facing some issues when on
> ingress.
>
> Please let us know what kind of inputs do you need?
>
>
>
> Thanks and Regards,
>
> Abhishek
>
>
>
> *From:* Battini Lakshman 
> *Sent:* Tuesday, January 22, 2019 6:02 PM
> *To:* user@spark.apache.org
> *Subject:* Spark UI History server on Kubernetes
>
>
>
> Hello,
>
>
>
> We are running Spark 2.4 on Kubernetes cluster, able to access the Spark
> UI using "kubectl port-forward".
>
>
>
> However, this spark UI contains currently running Spark application logs,
> we would like to maintain the 'completed' spark application logs as well.
> Could someone help us to setup 'Spark History server' on Kubernetes. Thanks!
>
>
>
> Best Regards,
>
> Lakshman Battini.
>
>


Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-19 Thread Li Gao
on yarn it is impossible afaik. on kubernetes you can use taints to keep
certain nodes outside of spark

On Fri, Jan 18, 2019 at 9:35 PM Felix Cheung 
wrote:

> Not as far as I recall...
>
>
> --
> *From:* Serega Sheypak 
> *Sent:* Friday, January 18, 2019 3:21 PM
> *To:* user
> *Subject:* Spark on Yarn, is it possible to manually blacklist nodes
> before running spark job?
>
> Hi, is there any possibility to tell Scheduler to blacklist specific nodes
> in advance?
>


[Spark on K8s] Scaling experiences sharing

2018-11-09 Thread Li Gao
Hi Spark Community,

I am reaching out to see if there are current large scale production or
pre-production deployment of Spark on k8s for batch and micro batch jobs.
Large scale means running 100s of thousand spark jobs daily and 1000s of
concurrent spark jobs on a single k8s cluster and 10s of millions of spark
executor pods daily (not concurrently).

If you happen to run and develop Spark on k8s at such scale, I'd want to
learn about your experience and scaling challenges and solutions.

Thank you,
Li


Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Li Gao
this is wonderful !
I noticed the official spark download site does not have 2.4 download links
yet.

On Thu, Nov 8, 2018, 4:11 PM Swapnil Shinde  Great news.. thank you very much!
>
> On Thu, Nov 8, 2018, 5:19 PM Stavros Kontopoulos <
> stavros.kontopou...@lightbend.com wrote:
>
>> Awesome!
>>
>> On Thu, Nov 8, 2018 at 9:36 PM, Jules Damji  wrote:
>>
>>> Indeed!
>>>
>>> Sent from my iPhone
>>> Pardon the dumb thumb typos :)
>>>
>>> On Nov 8, 2018, at 11:31 AM, Dongjoon Hyun 
>>> wrote:
>>>
>>> Finally, thank you all. Especially, thanks to the release manager,
>>> Wenchen!
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Thu, Nov 8, 2018 at 11:24 AM Wenchen Fan  wrote:
>>>
 + user list

 On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan  wrote:

> resend
>
> On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan 
> wrote:
>
>>
>>
>> -- Forwarded message -
>> From: Wenchen Fan 
>> Date: Thu, Nov 8, 2018 at 10:55 PM
>> Subject: [ANNOUNCE] Announcing Apache Spark 2.4.0
>> To: Spark dev list 
>>
>>
>> Hi all,
>>
>> Apache Spark 2.4.0 is the fifth release in the 2.x line. This release
>> adds Barrier Execution Mode for better integration with deep learning
>> frameworks, introduces 30+ built-in and higher-order functions to deal 
>> with
>> complex data type easier, improves the K8s integration, along with
>> experimental Scala 2.12 support. Other major updates include the built-in
>> Avro data source, Image data source, flexible streaming sinks, 
>> elimination
>> of the 2GB block size limitation during transfer, Pandas UDF 
>> improvements.
>> In addition, this release continues to focus on usability, stability, and
>> polish while resolving around 1100 tickets.
>>
>> We'd like to thank our contributors and users for their contributions
>> and early feedback to this release. This release would not have been
>> possible without you.
>>
>> To download Spark 2.4.0, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-2-4-0.html
>>
>> Thanks,
>> Wenchen
>>
>> PS: If you see any issues with the release notes, webpage or
>> published artifacts, please contact me directly off-list.
>>
>
>>
>>
>>
>>


Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Li Gao
Hi Yuqi,

Yes we are running Jupyter Gateway and kernels on k8s and using Spark 2.4's
client mode to launch pyspark. In client mode your driver is running on the
same pod where your kernel runs.

I am planning to write some blog post on this on some future date. Did you
make the headless service that reflects the driver pod name? Thats one of
critical pieces we automated in our custom code that makes the client mode
works.

-Li


On Wed, Oct 31, 2018 at 8:13 AM Zhang, Yuqi  wrote:

> Hi Li,
>
>
>
> Thank you for your reply.
>
> Do you mean running Jupyter client on k8s cluster to use spark 2.4?
> Actually I am also trying to set up JupyterHub on k8s to use spark, that’s
> why I would like to know how to run spark client mode on k8s cluster. If
> there is any related documentation on how to set up the Jupyter on k8s to
> use spark, could you please share with me?
>
>
>
> Thank you for your help!
>
>
>
> Best Regards,
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> [image: signature_147554612] <http://www.teradata.com/>
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com <http://www.teradata.com>
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and delete the e-mail and any attachments. Thank you.
>
> Please consider the environment before printing.
>
>
>
>
>
>
>
> *From: *Li Gao 
> *Date: *Thursday, November 1, 2018 0:07
> *To: *"Zhang, Yuqi" 
> *Cc: *"gourav.sengu...@gmail.com" , "
> user@spark.apache.org" , "Nogami, Masatsugu"
> 
> *Subject: *Re: [Spark Shell on AWS K8s Cluster]: Is there more
> documentation regarding how to run spark-shell on k8s cluster?
>
>
>
> Yuqi,
>
>
>
> Your error seems unrelated to headless service config you need to enable.
> For headless service you need to create a headless service that matches to
> your driver pod name exactly in order for spark 2.4 RC to work under client
> mode. We have this running for a while now using Jupyter kernel as the
> driver client.
>
>
>
> -Li
>
>
>
>
>
> On Wed, Oct 31, 2018 at 7:30 AM Zhang, Yuqi 
> wrote:
>
> Hi Gourav,
>
>
>
> Thank you for your reply.
>
>
>
> I haven’t try glue or EMK, but I guess it’s integrating kubernetes on aws
> instances?
>
> I could set up the k8s cluster on AWS, but my problem is don’t know how to
> run spark-shell on kubernetes…
>
> Since spark only support client mode on k8s from 2.4 version which is not
> officially released yet, I would like to ask if there is more detailed
> documentation regarding the way to run spark-shell on k8s cluster?
>
>
>
> Thank you in advance & best regards!
>
>
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> [image: signature_147554612] <http://www.teradata.com/>
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com <http://www.teradata.com>
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and delete the e-mail and any attachments. Thank you.
>
> Please consider the environment before printing.
>
>
>
>
>
>
>
> *From: *Gourav Sengupta 
> *Date: *Wednesday, October 31, 2018 18:34
> *To: *"Zhang, Yuqi" 
> *Cc: *user , "Nogami, Masatsugu"
> 
> *Subject: *Re: [Spark Shell on AWS K8s Cluster]: Is there more
> documentation regarding how to run spark-shell on k8s cluster?
>
>
>
> [External Email]
> --
>
> Just out of curiosity why would you not use Glue (which is Spark on
> kubernetes) or EMR?
>
>
>
> Regards,
>
> Gourav Sengupta
>
>
>
> On Mon, Oct 29, 2018 at 1:29 AM Zhang, Yuqi 
> wrote:
>
> Hello guys,
>
>
>
> I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem
> regarding using spark 2.4 client mode function on kubernetes cluster, so I
> would like to ask if there is some solution to my problem.
>
>
>
> The problem is when I am trying to run spark-shell on kubernetes v1.11.3
> cluster on AWS environment, I couldn’t successfully run stateful set using
> the docker image built from spark 2.4. The error message is showing below.
> The version I

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-10-31 Thread Li Gao
Yuqi,

Your error seems unrelated to headless service config you need to enable.
For headless service you need to create a headless service that matches to
your driver pod name exactly in order for spark 2.4 RC to work under client
mode. We have this running for a while now using Jupyter kernel as the
driver client.

-Li


On Wed, Oct 31, 2018 at 7:30 AM Zhang, Yuqi  wrote:

> Hi Gourav,
>
>
>
> Thank you for your reply.
>
>
>
> I haven’t try glue or EMK, but I guess it’s integrating kubernetes on aws
> instances?
>
> I could set up the k8s cluster on AWS, but my problem is don’t know how to
> run spark-shell on kubernetes…
>
> Since spark only support client mode on k8s from 2.4 version which is not
> officially released yet, I would like to ask if there is more detailed
> documentation regarding the way to run spark-shell on k8s cluster?
>
>
>
> Thank you in advance & best regards!
>
>
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> [image: signature_147554612] 
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com 
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and delete the e-mail and any attachments. Thank you.
>
> Please consider the environment before printing.
>
>
>
>
>
>
>
> *From: *Gourav Sengupta 
> *Date: *Wednesday, October 31, 2018 18:34
> *To: *"Zhang, Yuqi" 
> *Cc: *user , "Nogami, Masatsugu"
> 
> *Subject: *Re: [Spark Shell on AWS K8s Cluster]: Is there more
> documentation regarding how to run spark-shell on k8s cluster?
>
>
>
> [External Email]
> --
>
> Just out of curiosity why would you not use Glue (which is Spark on
> kubernetes) or EMR?
>
>
>
> Regards,
>
> Gourav Sengupta
>
>
>
> On Mon, Oct 29, 2018 at 1:29 AM Zhang, Yuqi 
> wrote:
>
> Hello guys,
>
>
>
> I am Yuqi from Teradata Tokyo. Sorry to disturb but I have some problem
> regarding using spark 2.4 client mode function on kubernetes cluster, so I
> would like to ask if there is some solution to my problem.
>
>
>
> The problem is when I am trying to run spark-shell on kubernetes v1.11.3
> cluster on AWS environment, I couldn’t successfully run stateful set using
> the docker image built from spark 2.4. The error message is showing below.
> The version I am using is spark v2.4.0-rc3.
>
>
>
> Also, I wonder if there is more documentation on how to use client-mode or
> integrate spark-shell on kubernetes cluster. From the documentation on
> https://github.com/apache/spark/blob/v2.4.0-rc3/docs/running-on-kubernetes.md
> there is only a brief description. I understand it’s not the official
> released version yet, but If there is some more documentation, could you
> please share with me?
>
>
>
> Thank you very much for your help!
>
>
>
>
>
> Error msg:
>
> + env
>
> + sed 's/[^=]*=\(.*\)/\1/g'
>
> + sort -t_ -k4 -n
>
> + grep SPARK_JAVA_OPT_
>
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
>
> + '[' -n '' ']'
>
> + '[' -n '' ']'
>
> + PYSPARK_ARGS=
>
> + '[' -n '' ']'
>
> + R_ARGS=
>
> + '[' -n '' ']'
>
> + '[' '' == 2 ']'
>
> + '[' '' == 3 ']'
>
> + case "$SPARK_K8S_CMD" in
>
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client
> "$@")
>
> + exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf
> spark.driver.bindAddress= --deploy-mode client
>
> Error: Missing application resource.
>
> Usage: spark-submit [options]  [app
> arguments]
>
> Usage: spark-submit --kill [submission ID] --master [spark://...]
>
> Usage: spark-submit --status [submission ID] --master [spark://...]
>
> Usage: spark-submit run-example [options] example-class [example args]
>
>
>
>
>
> --
>
> Yuqi Zhang
>
> Software Engineer
>
> m: 090-6725-6573
>
>
> [image: signature_147554612] 
>
> 2 Chome-2-23-1 Akasaka
>
> Minato, Tokyo 107-0052
> teradata.com 
>
> This e-mail is from Teradata Corporation and may contain information that
> is confidential or proprietary. If you are not the intended recipient, do
> not read, copy or distribute the e-mail or any attachments. Instead, please
> notify the sender and delete the e-mail and any attachments. Thank you.
>
> Please consider the environment before printing.
>
>
>
>
>
>


Re: External shuffle service on K8S

2018-10-26 Thread Li Gao
There are existing 2.2 based ext shuffle on the fork:
https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html

You can modify it to suit your needs.

-Li


On Fri, Oct 26, 2018 at 3:22 AM vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> No it's on the roadmap >2.4
>
> Le ven. 26 oct. 2018 à 11:15, 曹礼俊  a écrit :
>
>> Hi all:
>>
>> Does Spark 2.3.2 supports external shuffle service on Kubernetes?
>>
>> I have looked up the documentation(
>> https://spark.apache.org/docs/latest/running-on-kubernetes.html), but
>> couldn't find related suggestions.
>>
>> If suppports, how can I enable it?
>>
>> Best Regards
>>
>> Lijun Cao
>>
>>
>>
>


[K8S] Option to keep the executor pods after job finishes

2018-10-09 Thread Li Gao
Hi,

Is there an option to keep the executor pods on k8s after the job finishes?
We want to extract the logs and stats before removing the executor pods.

Thanks,
Li


Re: [K8S] Spark initContainer custom bootstrap support for Spark master

2018-08-16 Thread Li Gao
Thanks! We will likely use the second option to customize the bootstrap.

On Thu, Aug 16, 2018 at 10:04 AM Yinan Li  wrote:

> Yes, the init-container has been removed in the master branch. The
> init-container was used in 2.3.x only for downloading remote dependencies,
> which is now handled by running spark-submit in the driver. If you need to
> run custom bootstrap scripts using an init-container, the best option would
> be to use a mutating admission webhook to inject your init-container into
> the Spark pods. Another option is to create a custom image that runs the
> scripts prior to entering the entrypoint.
>
> Yinan
>
> On Wed, Aug 15, 2018 at 9:12 AM Li Gao  wrote:
>
>> Hi,
>>
>> We've noticed on the latest Master (not Spark 2.3.1 branch), the support
>> for Kubernetes initContainer is no longer there. What would be the path
>> forward if we need to do custom bootstrap actions (i.e. run additional
>> scripts) prior to driver/executor container entering running mode?
>>
>> Thanks,
>> Li
>>
>>


[K8S] Spark initContainer custom bootstrap support for Spark master

2018-08-15 Thread Li Gao
Hi,

We've noticed on the latest Master (not Spark 2.3.1 branch), the support
for Kubernetes initContainer is no longer there. What would be the path
forward if we need to do custom bootstrap actions (i.e. run additional
scripts) prior to driver/executor container entering running mode?

Thanks,
Li


Spark 2.4 release date

2018-06-18 Thread Li Gao
Hello,

Do we know the estimate when Spark 2.4 will be GA?
We are evaluating whether to back port some of 2.4 fixes into our 2.3
deployment.

Thank you.