Re: [PSA] Sharing our Experiences With Kubernetes

2019-05-17 Thread Ramandeep Singh
Thanks for sharing. On Sat, May 18, 2019, 00:52 Matt Cheah wrote: > Hi everyone, > > > > I would like to share the experiences my organization has had with > deploying Kubernetes and migrating our Spark applications to Kubernetes > over from YARN. We are publishing a series of blog posts that

Fwd: Spark Architecture, Drivers, & Executors

2019-05-17 Thread Pat Ferrel
In order to create an application that executes code on Spark we have a long lived process. It periodically runs jobs programmatically on a Spark cluster, meaning it does not use spark-submit. The Jobs it executes have varying requirements for memory so we want to have the Spark Driver run in the

[PSA] Sharing our Experiences With Kubernetes

2019-05-17 Thread Matt Cheah
Hi everyone, I would like to share the experiences my organization has had with deploying Kubernetes and migrating our Spark applications to Kubernetes over from YARN. We are publishing a series of blog posts that describe what we have learned and what we have built. Our introduction

Re: Access to live data of cached dataFrame

2019-05-17 Thread Sean Owen
A cached DataFrame isn't supposed to change, by definition. You can re-read each time or consider setting up a streaming source on the table which provides a result that updates as new data comes in. On Fri, May 17, 2019 at 1:44 PM Tomas Bartalos wrote: > > Hello, > > I have a cached dataframe:

Access to live data of cached dataFrame

2019-05-17 Thread Tomas Bartalos
Hello, I have a cached dataframe: spark.read.format("delta").load("/data").groupBy(col("event_hour")).count.cache I would like to access the "live" data for this data frame without deleting the cache (using unpersist()). Whatever I do I always get the cached data on subsequent queries. Even

Out Of Memory while reading a table partition from HIVE

2019-05-17 Thread Shivam Sharma
Hi All, I am getting Out Of Memory due to GC overhead while reading a table from HIVE from spark like: spark.sql("SELECT * FROM some.table where date='2019-05-14' LIMIT > 10").show() So when I run above command in spark-shell then it starts processing *1780 tasks* where it goes OOM at a

Re: Spark job gets hung on cloudera cluster

2019-05-17 Thread Rishi Shah
Yes that's exactly what happens, but I would think that if data node is unavailable/unavailability of data for one of the nodes should not cause indefinite wait.. Are there any properties we can set to avoid getting into indefinite/non-deterministic outcome of a spark application? On Thu, May 16,

design question related to kafka.

2019-05-17 Thread Shyam P
Hi, https://stackoverflow.com/questions/56181135/design-can-kafka-producer-written-as-spark-job Thank you, Shyam