ilter(col("id0").bitwiseOR(col("id1")) % jobs == mod) \
> .withColumn("test", test_score_r4(col("id0"), col("id1"))) \
> .cache()
> df.count()
> df.coalesce(300).write.mode("overwrite").parquet(output_mod)
One guess - you are doing two things here, count() and write(). There is a
persist(), but it's async. It won't necessarily wait for the persist to
finish before proceeding and may have to recompute at least some partitions
for the second op. You could debug further by looking at the stages and
seei
-
> *From:* Deepak Sharma
> *Sent:* Sunday, January 30, 2022 12:45 AM
> *To:* Benjamin Du
> *Cc:* u...@spark.incubator.apache.org
> *Subject:* Re: A Persisted Spark DataFrame is computed twice
>
> coalesce returns a new dataset.
> That will cause the reco
0, 2022 1:08 AM
To: sebastian@gmail.com
Cc: Benjamin Du ; u...@spark.incubator.apache.org
Subject: Re: A Persisted Spark DataFrame is computed twice
Hi,
without getting into suppositions, the best option is to look into the SPARK UI
SQL section.
It is the most wonderful tool to explain w
: Deepak Sharma
Sent: Sunday, January 30, 2022 12:45 AM
To: Benjamin Du
Cc: u...@spark.incubator.apache.org
Subject: Re: A Persisted Spark DataFrame is computed twice
coalesce returns a new dataset.
That will cause the recomputation.
Thanks
Deepak
On Sun, 30 Jan 2022 at 14:06, Benjamin Du
m
e a DataFrame to disk, read it back,
repartition/coalesce it, and then write it back to HDFS.
spark.read.parquet("/input/hdfs/path") \
.filter(col("n0") == n0) \
.filter(col("n1") == n1) \
.filter(col("h1") == h1) \
.filter
Column("new_col", my_pandas_udf("col0", "col1")) \
>> .persist(StorageLevel.DISK_ONLY)
>> df.count()
>> df.coalesce(300).write.mode("overwrite").parquet(output_mod)
>>
>>
>> BTW, it works well if I manually write the Da
NLY)
> df.count()
> df.coalesce(300).write.mode("overwrite").parquet(output_mod)
>
>
> BTW, it works well if I manually write the DataFrame to HDFS, read it
> back, coalesce it and write it back to HDFS.
> Originally post at
> https://stackoverflow.com/questions/70
te the DataFrame to HDFS, read it
> back, coalesce it and write it back to HDFS.
> Originally post at
> https://stackoverflow.com/questions/70781494/a-persisted-spark-dataframe-is-computed-twice.
> <https://stackoverflow.com/questions/70781494/a-persisted-spark-dataframe-is-computed-twice>
>
write it back to HDFS.
Originally post at
https://stackoverflow.com/questions/70781494/a-persisted-spark-dataframe-is-computed-twice.<https://stackoverflow.com/questions/70781494/a-persisted-spark-dataframe-is-computed-twice>
Best,
Ben Du
Personal Blog<http://www.legendu.net/> | GitHub&
10 matches
Mail list logo