As pointed out by Dongjoon, the 2nd half of December is the holiday season
in most countries. If we do the code freeze in mid November and release the
first RC in mid December. I am afraid the community will not be active to
verify the release candidates during the holiday season. Normally, the RC
>
> Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.
I think we made a change in release cadence since Spark 2.3. See the
commit:
https://github.com/apache/spark-website/commit/88990968962e5cc47db8bc2c11a50742d2438daa
Thus, Spark 3.1 might just follow the release cadence of Spark
For Xiao's comment, I want to point out that Apache Spark 3.1.0 is
different from 2.3 or 2.4.
Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.
- Apache Spark 2.0.0 was released on July 26, 2016.
- Apache Spark 2.1.0 was released on December 28, 2016.
Bests,
Dongjoon.
On Sun, Oct
Thank you all.
BTW, Xiao and Mridul, I'm wondering what date you have in your mind
specifically.
Usually, `Christmas and New Year season` doesn't give us much additional
time.
If you think so, could you make a PR for Apache Spark website according to
your expectation?
Please ignore this question.
https://kontext.tech/column/spark/370/improve-pyspark-performance-using-pandas-udf-with-apache-arrow
shows pandas udf should have avoided jvm<->Python SerDe by maintaining one
data copy in memory. spark.sql.execution.arrow.enabled is false by default.
I think I missed
Hi,
I am using pyspark Grouped Map pandas UDF (
https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html).
Functionality wise it works great. However, serDe causes a lot of perf
hits. To optimize this UDF, can I do either below:
1. use a java UDF to completely replace the python
Is there any concurrent access to broadcast variables in the workers? My
use case is that the broadcast variable is a large dataset object which is
*only* read. However, there is some cache in this object which is written
(so it speeds up generation of entries in this dataset). Everything will
+1 on pushing the branch cut for increased dev time to match previous
releases.
Regards,
Mridul
On Sat, Oct 3, 2020 at 10:22 PM Xiao Li wrote:
> Thank you for your updates.
>
> Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date of
> the 3.1 branch cut, the feature
I am trying to run a Spark structured streaming program simulating basic
scenario of ingesting events and calculating aggregates on a window with
watermark, and I am observing an inordinate amount of disk IO Spark
performs.
The basic structure of the program is like this:
sparkSession =