Re: A Persisted Spark DataFrame is computed twice

2022-01-31 Thread Benjamin Du
22 1:08 AM To: sebastian@gmail.com Cc: Benjamin Du ; u...@spark.incubator.apache.org Subject: Re: A Persisted Spark DataFrame is computed twice Hi, without getting into suppositions, the best option is to look into the SPARK UI SQL section. It is the most wonderful tool to explain wh

Re: A Persisted Spark DataFrame is computed twice

2022-01-31 Thread Benjamin Du
pak Sharma Sent: Sunday, January 30, 2022 12:45 AM To: Benjamin Du Cc: u...@spark.incubator.apache.org Subject: Re: A Persisted Spark DataFrame is computed twice coalesce returns a new dataset. That will cause the recomputation. Thanks Deepak On Sun, 30 Jan 2022 at 14:06, Benjamin Du mailto

Re: A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Benjamin Du
ataFrame to disk, read it back, repartition/coalesce it, and then write it back to HDFS. spark.read.parquet("/input/hdfs/path") \ .filter(col("n0") == n0) \ .filter(col("n1") == n1) \ .filter(col("h1") == h1) \ .filter(col(

A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Benjamin Du
I have some PySpark code like below. Basically, I persist a DataFrame (which is time-consuming to compute) to disk, call the method DataFrame.count to trigger the caching/persist immediately, and then I coalesce the DataFrame to reduce the number of partitions (the original DataFrame has 30,000

Re: [RNG]: How does Spark handle RNGs?

2021-10-04 Thread Benjamin Du
From: Sean Owen Sent: Monday, October 4, 2021 1:00 PM To: Benjamin Du Cc: user@spark.apache.org Subject: Re: [RNG]: How does Spark handle RNGs? The 2nd approach. Spark doesn't work in the 1st way in any context - the driver and executor processes do not cooperate during execution. Operati

[RNG]: How does Spark handle RNGs?

2021-10-04 Thread Benjamin Du
Hi everyone, I'd like to ask how does Spark (or more generally, distributed computing engines) handle RNGs? High-level speaking, there are two ways, 1. Use a single RNG on the driver and random numbers generating on each work makes request to the single RNG on the driver. 2. Use a

[RNG]: How does Spark handle RNGs?

2021-10-04 Thread Benjamin Du
Hi everyone, I'd like to ask how does Spark (or more generally, distributed computing engines) handle RNGs? High-level speaking, there are two ways, 1. Use a single RNG on the driver and random numbers generating on each work makes request to the single RNG on the driver. 2. Use a