Re: Handling skew in window functions

2021-04-27 Thread Mich Talebzadeh
Hi, Let us go back and understand this behaviour. Sounds like your partitioning with (user_id, group_id) results in skewed data. We just had a similar skewed data issue/thread with title "Tasks are skewed to one executor" Have a look at that one and see whether any of those suggestions like

Handling skew in window functions

2021-04-27 Thread Michael Doo
Hello! I have a data set that I'm trying to process in PySpark. The data (on disk as Parquet) contains user IDs, session IDs, and metadata related to each session. I'm adding a number of columns to my dataframe that are the result of aggregating over a window. The issue I'm running into is that

Re: Spark Streaming non functional requirements

2021-04-27 Thread Mich Talebzadeh
Forgot to add under non-functional requirements under heading - *Supportability and Maintainability* Someone queried the other day on how to shutdown a streaming job gracefully, meaning wait until such time as the "current queue" including backlog is drained and all processing is completed.

Re: Spark Streaming non functional requirements

2021-04-27 Thread ashok34...@yahoo.com.INVALID
Hello Mich Thank you for your great explanation. Best A. On Tuesday, 27 April 2021, 11:25:19 BST, Mich Talebzadeh wrote: Hi, Any design (in whatever framework) needs to consider both Functional and non-functional requirements. Functional requirements are those which are related to

Re: Spark Streaming non functional requirements

2021-04-27 Thread Mich Talebzadeh
Hi, Any design (in whatever framework) needs to consider both Functional and non-functional requirements. Functional requirements are those which are related to the technical functionality of the system that we cover daily in this forum. The non-functional requirement is a requirement that

Re: How to calculate percentiles in Scala Spark 2.4.x

2021-04-27 Thread Sean Owen
Erm, just https://spark.apache.org/docs/2.3.0/api/sql/index.html#approx_percentile ? On Tue, Apr 27, 2021 at 3:52 AM Ivan Petrov wrote: > Hi, I have billions, potentially dozens of billions of observations. Each > observation is a decimal number. > I need to calculate percentiles 1, 25, 50, 75,

How to calculate percentiles in Scala Spark 2.4.x

2021-04-27 Thread Ivan Petrov
Hi, I have billions, potentially dozens of billions of observations. Each observation is a decimal number. I need to calculate percentiles 1, 25, 50, 75, 95 for these observations using Scala Spark. I can use both RDD and Dataset API. Whatever would work better. What I can do in terms of perf

Re: com.google.protobuf.Parser.parseFrom() method Can't use in spark

2021-04-27 Thread null
I tried it on Ubuntu( Original spark ), and every was right then! it looks like Amazon spark or cdh spark changed something of spark which cause this error. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ -

Re: com.google.protobuf.Parser.parseFrom() method Can't use in spark

2021-04-27 Thread null
I tried it on Ubuntu( Original spark ), and every was right then! it looks like Amazon spark or cdh spark changed something of spark which cause this error. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ -

Re: [Spark in Kubernetes] Question about running in client mode

2021-04-27 Thread Shiqi Sun
Hi Attila, Ah that makes sense. Thanks for the clarification! Best, Shiqi On Mon, Apr 26, 2021 at 8:09 PM Attila Zsolt Piros < piros.attila.zs...@gmail.com> wrote: > Hi Shiqi, > > In case of client mode the driver runs locally: in the same machine, even > in the same process, of the spark