Re: Structured Streaming on Kubernetes

Krishna Kalyan Mon, 16 Apr 2018 01:52:09 -0700

Thank you so much TD, Matt, Anirudh and Oz,
Really appropriate this.

On Fri, Apr 13, 2018 at 9:54 PM, Oz Ben-Ami <[email protected]> wrote:


> I can confirm that Structured Streaming works on Kubernetes, though we're
> not quite on production with that yet. Issues we're looking at are:
> - Submission through spark-submit works, but is a bit clunky with a
> kubernetes-centered workflow. Spark Operator
> <https://github.com/GoogleCloudPlatform/spark-on-k8s-operator> is
> promising, but still in alpha (eg, we ran into this
> <https://github.com/kubernetes/kubernetes/issues/56018>). Even better
> would be something that runs the driver as a Deployment / StatefulSet, so
> that long-running streaming jobs can be restarted automatically
> - Dynamic allocation: works with the spark-on-k8s fork, but not with plain
> Spark 2.3, due to reliance on shuffle service which hasn't been merged yet.
> Ideal implementation would be able to connect to a PersistentVolume
> independently of a node, but that's a bit more complicated
> - Checkpointing: We checkpoint to a separate HDFS (Dataproc) cluster,
> which works well for us both on the old Spark Streaming and Structured
> Streaming. We've successfully experimented with HDFS on Kubernetes
> <https://github.com/apache-spark-on-k8s/kubernetes-HDFS/tree/master>, but
> again not in production
> - UI: Unfortunately Structured Streaming does not yet have a comprehensive
> UI like the old Spark Streaming, but it does show the basic information
> (jobs, stages, queries, executors), and other information is generally
> available in the logs and metrics
> - Monitoring / Logging: this is a strength of Kubernetes, in that it's all
> centralized by the cluster. We use Splunk, but it would also be possible to 
> hook
> up <https://github.com/dhatim/dropwizard-prometheus> Spark's Dropwizard
> Metrics library to Prometheus, and read logs with fluentd or Stackdriver.
> - Side note: Kafka support in Spark and Structured Streaming is very good,
> but as of Spark 2.3 there are still a couple of missing features, notably
> transparent avro support (UDFs are needed) and taking advantage of
> transactional processing (introduced to Kafka last year) for better
> exactly-once guarantees
>
> On Fri, Apr 13, 2018 at 3:08 PM, Anirudh Ramanathan <
> [email protected]> wrote:
>
>> +ozzieba who was experimenting with streaming workloads recently. +1 to
>> what Matt said. Checkpointing and driver recovery is future work.
>> Structured streaming is important, and it would be good to get some
>> production experiences here and try and target improving the feature's
>> support on K8s for 2.4/3.0.
>>
>>
>> On Fri, Apr 13, 2018 at 11:55 AM Matt Cheah <[email protected]> wrote:
>>
>>> We don’t provide any Kubernetes-specific mechanisms for streaming, such
>>> as checkpointing to persistent volumes. But as long as streaming doesn’t
>>> require persisting to the executor’s local disk, streaming ought to work
>>> out of the box. E.g. you can checkpoint to HDFS, but not to the pod’s local
>>> directories.
>>>
>>>
>>>
>>> However, I’m unaware of any specific use of streaming with the Spark on
>>> Kubernetes integration right now. Would be curious to get feedback on the
>>> failover behavior right now.
>>>
>>>
>>>
>>> -Matt Cheah
>>>
>>>
>>>
>>> *From: *Tathagata Das <[email protected]>
>>> *Date: *Friday, April 13, 2018 at 1:27 AM
>>> *To: *Krishna Kalyan <[email protected]>
>>> *Cc: *user <[email protected]>
>>> *Subject: *Re: Structured Streaming on Kubernetes
>>>
>>>
>>>
>>> Structured streaming is stable in production! At Databricks, we and our
>>> customers collectively process almost 100s of billions of records per day
>>> using SS. However, we are not using kubernetes :)
>>>
>>>
>>>
>>> Though I don't think it will matter too much as long as kubes are
>>> correctly provisioned+configured and you are checkpointing to HDFS (for
>>> fault-tolerance guarantees).
>>>
>>>
>>>
>>> TD
>>>
>>>
>>>
>>> On Fri, Apr 13, 2018, 12:28 AM Krishna Kalyan <[email protected]>
>>> wrote:
>>>
>>> Hello All,
>>>
>>> We were evaluating Spark Structured Streaming on Kubernetes (Running on
>>> GCP). It would be awesome if the spark community could share their
>>> experience around this. I would like to know more about you production
>>> experience and the monitoring tools you are using.
>>>
>>>
>>>
>>> Since spark on kubernetes is a relatively new addition to spark, I was
>>> wondering if structured streaming is stable in production. We were also
>>> evaluating Apache Beam with Flink.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Krishna
>>>
>>>
>>>
>>>
>>>
>>>
>>
>> --
>> Anirudh Ramanathan
>>
>
>

Re: Structured Streaming on Kubernetes

Reply via email to