Hi,
Let us go back and understand this behaviour.
Sounds like your partitioning with (user_id, group_id) results in skewed
data.
We just had a similar skewed data issue/thread with title
"Tasks are skewed to one executor"
Have a look at that one and see whether any of those suggestions like
Hello!
I have a data set that I'm trying to process in PySpark. The data (on disk
as Parquet) contains user IDs, session IDs, and metadata related to each
session. I'm adding a number of columns to my dataframe that are the result
of aggregating over a window. The issue I'm running into is that
Forgot to add under non-functional requirements under heading
- *Supportability and Maintainability*
Someone queried the other day on how to shutdown a streaming job
gracefully, meaning wait until such time as the "current queue" including
backlog is drained and all processing is completed.
Hello Mich
Thank you for your great explanation.
Best
A.
On Tuesday, 27 April 2021, 11:25:19 BST, Mich Talebzadeh
wrote:
Hi,
Any design (in whatever framework) needs to consider both Functional and
non-functional requirements. Functional requirements are those which are
related to
Hi,
Any design (in whatever framework) needs to consider both Functional and
non-functional requirements. Functional requirements are those which are
related to the technical functionality of the system that we cover daily in
this forum. The non-functional requirement is a requirement that
Erm, just
https://spark.apache.org/docs/2.3.0/api/sql/index.html#approx_percentile ?
On Tue, Apr 27, 2021 at 3:52 AM Ivan Petrov wrote:
> Hi, I have billions, potentially dozens of billions of observations. Each
> observation is a decimal number.
> I need to calculate percentiles 1, 25, 50, 75,
Hi, I have billions, potentially dozens of billions of observations. Each
observation is a decimal number.
I need to calculate percentiles 1, 25, 50, 75, 95 for these observations
using Scala Spark. I can use both RDD and Dataset API. Whatever would work
better.
What I can do in terms of perf
I tried it on Ubuntu( Original spark ), and every was right then! it looks
like Amazon spark or cdh spark changed something of spark which cause this
error.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
I tried it on Ubuntu( Original spark ), and every was right then! it looks
like Amazon spark or cdh spark changed something of spark which cause this
error.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
Hi Attila,
Ah that makes sense. Thanks for the clarification!
Best,
Shiqi
On Mon, Apr 26, 2021 at 8:09 PM Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:
> Hi Shiqi,
>
> In case of client mode the driver runs locally: in the same machine, even
> in the same process, of the spark
10 matches
Mail list logo