22 1:08 AM
To: sebastian@gmail.com
Cc: Benjamin Du ; u...@spark.incubator.apache.org
Subject: Re: A Persisted Spark DataFrame is computed twice
Hi,
without getting into suppositions, the best option is to look into the SPARK UI
SQL section.
It is the most wonderful tool to explain wh
pak Sharma
Sent: Sunday, January 30, 2022 12:45 AM
To: Benjamin Du
Cc: u...@spark.incubator.apache.org
Subject: Re: A Persisted Spark DataFrame is computed twice
coalesce returns a new dataset.
That will cause the recomputation.
Thanks
Deepak
On Sun, 30 Jan 2022 at 14:06, Benjamin Du
mailto
ataFrame to disk, read it back,
repartition/coalesce it, and then write it back to HDFS.
spark.read.parquet("/input/hdfs/path") \
.filter(col("n0") == n0) \
.filter(col("n1") == n1) \
.filter(col("h1") == h1) \
.filter(col(
I have some PySpark code like below. Basically, I persist a DataFrame (which is
time-consuming to compute) to disk, call the method DataFrame.count to trigger
the caching/persist immediately, and then I coalesce the DataFrame to reduce
the number of partitions (the original DataFrame has 30,000
From: Sean Owen
Sent: Monday, October 4, 2021 1:00 PM
To: Benjamin Du
Cc: user@spark.apache.org
Subject: Re: [RNG]: How does Spark handle RNGs?
The 2nd approach. Spark doesn't work in the 1st way in any context - the driver
and executor processes do not cooperate during execution.
Operati
Hi everyone,
I'd like to ask how does Spark (or more generally, distributed computing
engines) handle RNGs? High-level speaking, there are two ways,
1. Use a single RNG on the driver and random numbers generating on each work
makes request to the single RNG on the driver.
2. Use a
Hi everyone,
I'd like to ask how does Spark (or more generally, distributed computing
engines) handle RNGs? High-level speaking, there are two ways,
1. Use a single RNG on the driver and random numbers generating on each work
makes request to the single RNG on the driver.
2. Use a