subject:"A Persisted Spark DataFrame is computed twice"

Re: A Persisted Spark DataFrame is computed twice

2022-02-01 Thread Gourav Sengupta

(col("id0").bitwiseOR(col("id1")) % jobs == mod) \ > .withColumn("test", test_score_r4(col("id0"), col("id1"))) \ > .cache() > df.count() > df.coalesce(300).write.mode("overwrite").parquet(output_mod)

Re: A Persisted Spark DataFrame is computed twice

2022-01-31 Thread Sean Owen

One guess - you are doing two things here, count() and write(). There is a persist(), but it's async. It won't necessarily wait for the persist to finish before proceeding and may have to recompute at least some partitions for the second op. You could debug further by looking at the stages and

Re: A Persisted Spark DataFrame is computed twice

2022-01-31 Thread Sebastian Piu

--- > *From:* Deepak Sharma > *Sent:* Sunday, January 30, 2022 12:45 AM > *To:* Benjamin Du > *Cc:* u...@spark.incubator.apache.org > *Subject:* Re: A Persisted Spark DataFrame is computed twice > > coalesce returns a new dataset. > That will cause the recomputation.

Re: A Persisted Spark DataFrame is computed twice

2022-01-31 Thread Benjamin Du

22 1:08 AM To: sebastian@gmail.com Cc: Benjamin Du ; u...@spark.incubator.apache.org Subject: Re: A Persisted Spark DataFrame is computed twice Hi, without getting into suppositions, the best option is to look into the SPARK UI SQL section. It is the most wonderful tool to explain wh

Re: A Persisted Spark DataFrame is computed twice

2022-01-31 Thread Benjamin Du

pak Sharma Sent: Sunday, January 30, 2022 12:45 AM To: Benjamin Du Cc: u...@spark.incubator.apache.org Subject: Re: A Persisted Spark DataFrame is computed twice coalesce returns a new dataset. That will cause the recomputation. Thanks Deepak On Sun, 30 Jan 2022 at 14:06, Benjamin Du mailto

Re: A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Benjamin Du

ataFrame to disk, read it back, repartition/coalesce it, and then write it back to HDFS. spark.read.parquet("/input/hdfs/path") \ .filter(col("n0") == n0) \ .filter(col("n1") == n1) \ .filter(col("h1") == h1) \ .filter(col(

Re: A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Gourav Sengupta

n("new_col", my_pandas_udf("col0", "col1")) \ >> .persist(StorageLevel.DISK_ONLY) >> df.count() >> df.coalesce(300).write.mode("overwrite").parquet(output_mod) >> >> >> BTW, it works well if I manually write the DataFra

Re: A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Deepak Sharma

NLY) > df.count() > df.coalesce(300).write.mode("overwrite").parquet(output_mod) > > > BTW, it works well if I manually write the DataFrame to HDFS, read it > back, coalesce it and write it back to HDFS. > Originally post at > https://stackoverflow.com/questions/70

Re: A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Sebastian Piu

e DataFrame to HDFS, read it > back, coalesce it and write it back to HDFS. > Originally post at > https://stackoverflow.com/questions/70781494/a-persisted-spark-dataframe-is-computed-twice. > <https://stackoverflow.com/questions/70781494/a-persisted-spark-dataframe-is-computed-twice> > >

A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Benjamin Du

write it back to HDFS. Originally post at https://stackoverflow.com/questions/70781494/a-persisted-spark-dataframe-is-computed-twice.<https://stackoverflow.com/questions/70781494/a-persisted-spark-dataframe-is-computed-twice> Best, Ben Du Personal Blog<http://www.legendu.net/> | GitHub&

Re: A Persisted Spark DataFrame is computed twice

Re: A Persisted Spark DataFrame is computed twice

Re: A Persisted Spark DataFrame is computed twice

Re: A Persisted Spark DataFrame is computed twice

Re: A Persisted Spark DataFrame is computed twice

Re: A Persisted Spark DataFrame is computed twice

Re: A Persisted Spark DataFrame is computed twice

Re: A Persisted Spark DataFrame is computed twice

Re: A Persisted Spark DataFrame is computed twice

A Persisted Spark DataFrame is computed twice

10 matches

Site Navigation

Mail list logo

Footer information