Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Xiao Li
As pointed out by Dongjoon, the 2nd half of December is the holiday season in most countries. If we do the code freeze in mid November and release the first RC in mid December. I am afraid the community will not be active to verify the release candidates during the holiday season. Normally, the RC

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Xiao Li
> > Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0. I think we made a change in release cadence since Spark 2.3. See the commit: https://github.com/apache/spark-website/commit/88990968962e5cc47db8bc2c11a50742d2438daa Thus, Spark 3.1 might just follow the release cadence of Spark

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Dongjoon Hyun
For Xiao's comment, I want to point out that Apache Spark 3.1.0 is different from 2.3 or 2.4. Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0. - Apache Spark 2.0.0 was released on July 26, 2016. - Apache Spark 2.1.0 was released on December 28, 2016. Bests, Dongjoon. On Sun, Oct

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Dongjoon Hyun
Thank you all. BTW, Xiao and Mridul, I'm wondering what date you have in your mind specifically. Usually, `Christmas and New Year season` doesn't give us much additional time. If you think so, could you make a PR for Apache Spark website according to your expectation?

Re: use java in Grouped Map pandas udf to avoid serDe

2020-10-04 Thread Lian Jiang
Please ignore this question. https://kontext.tech/column/spark/370/improve-pyspark-performance-using-pandas-udf-with-apache-arrow shows pandas udf should have avoided jvm<->Python SerDe by maintaining one data copy in memory. spark.sql.execution.arrow.enabled is false by default. I think I missed

use java in Grouped Map pandas udf to avoid serDe

2020-10-04 Thread Lian Jiang
Hi, I am using pyspark Grouped Map pandas UDF ( https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html). Functionality wise it works great. However, serDe causes a lot of perf hits. To optimize this UDF, can I do either below: 1. use a java UDF to completely replace the python

Broadcast Variable question

2020-10-04 Thread Eduardo
Is there any concurrent access to broadcast variables in the workers? My use case is that the broadcast variable is a large dataset object which is *only* read. However, there is some cache in this object which is written (so it speeds up generation of entries in this dataset). Everything will

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Mridul Muralidharan
+1 on pushing the branch cut for increased dev time to match previous releases. Regards, Mridul On Sat, Oct 3, 2020 at 10:22 PM Xiao Li wrote: > Thank you for your updates. > > Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date of > the 3.1 branch cut, the feature

Excessive disk IO with Spark structured streaming

2020-10-04 Thread Sergey Oboguev
I am trying to run a Spark structured streaming program simulating basic scenario of ingesting events and calculating aggregates on a window with watermark, and I am observing an inordinate amount of disk IO Spark performs. The basic structure of the program is like this: sparkSession =