Looks like we resolved all standing issues known so far. I will start another RC next Monday PST.
2021년 2월 4일 (목) 오전 12:03, Kent Yao <yaooq...@qq.com>님이 작성: > Sending https://github.com/apache/spark/pull/31460 > > Based my research so far, when there is there is an existing > *io.file.buffer.size* in hive-site.xml, the hadoopConf finallly get reset > by that. > In many real-world cases, when interacting with hive catalog through > Spark SQL, users may just share thehive-site.xm for their hive jobs and > make a copy to SPARK_HOM/conf w/o modification. In Spark, when we > generate Hadoop configurations, we will use*spark.buffer.size(65536)* to > reseti*o.file.buffer.size(4096)*. But when we load the hive-site.xml, we > may ignore this behavior and reset *io.file.buffer.size* again according > to hive-site.xml. > > The PR fixes: > 1. The configuration priority for setting Hadoop and Hive config here is > not right, while literally, the order should be *spark > spark.hive > > spark.hadoop > hive > hadoop* > 2. This breaks *spark.buffer.size* congfig's behavior for tuning the IO > performance w/ HDFS if there is an existing io.file.buffer.size in > hive-site.xml > > *Kent Yao * > @ Data Science Center, Hangzhou Research Institute, NetEase Corp. > *a spark enthusiast* > *kyuubi <https://github.com/yaooqinn/kyuubi>is a unified multi-tenant JDBC > interface for large-scale data processing and analytics, built on top > of Apache Spark <http://spark.apache.org/>.* > *spark-authorizer <https://github.com/yaooqinn/spark-authorizer>A Spark > SQL extension which provides SQL Standard Authorization for **Apache > Spark <http://spark.apache.org/>.* > *spark-postgres <https://github.com/yaooqinn/spark-postgres> A library for > reading data from and transferring data to Postgres / Greenplum with Spark > SQL and DataFrames, 10~100x faster.* > *spark-func-extras <https://github.com/yaooqinn/spark-func-extras>A > library that brings excellent and useful functions from various modern > database management systems to Apache Spark <http://spark.apache.org/>.* > > > > On 02/3/2021 15:36,Maxim Gekk<maxim.g...@databricks.com> > <maxim.g...@databricks.com> wrote: > > Hi All, > > > Also I am investigating a performance regression in some TPC-DS queries > (q88 for instance) that is caused by a recent commit in 3.1 ... > > I have found that the perf regression is caused by the Hadoop config: > io.file.buffer.size = 4096 > Before the commit > https://github.com/apache/spark/commit/278f6f45f46ccafc7a31007d51ab9cb720c9cb14, > we had: > io.file.buffer.size = 65536 > > Maxim Gekk > > Software Engineer > > Databricks, Inc. > > > On Wed, Feb 3, 2021 at 2:37 AM Hyukjin Kwon <gurwls...@gmail.com> wrote: > >> Yeah, agree. I changed. Thanks for the heads up. Tom. >> >> 2021년 2월 3일 (수) 오전 8:31, Tom Graves <tgraves...@yahoo.com>님이 작성: >> >>> ok thanks for the update. That is marked as an improvement, if its a >>> blocker can we mark it as such and describe why. I searched jiras and >>> didn't see any critical or blockers open. >>> >>> Tom >>> On Tuesday, February 2, 2021, 05:12:24 PM CST, Hyukjin Kwon < >>> gurwls...@gmail.com> wrote: >>> >>> >>> There is one here: https://github.com/apache/spark/pull/31440. There >>> look several issues being identified (to confirm that this is an issue in >>> OSS too), and fixed in parallel. >>> There are a bit of unexpected delays here as several issues more were >>> found. I will try to file and share relevant JIRAs as soon as I can confirm. >>> >>> 2021년 2월 3일 (수) 오전 2:36, Tom Graves <tgraves...@yahoo.com>님이 작성: >>> >>> Just curious if we have an update on next rc? is there a jira for the >>> tpcds issue? >>> >>> Thanks, >>> Tom >>> >>> On Wednesday, January 27, 2021, 05:46:27 PM CST, Hyukjin Kwon < >>> gurwls...@gmail.com> wrote: >>> >>> >>> Just to share the current status, most of the known issues were >>> resolved. Let me know if there are some more. >>> One thing left is a performance regression in TPCDS being investigated. >>> Once this is identified (and fixed if it should be), I will cut another RC >>> right away. >>> I roughly expect to cut another RC next Monday. >>> >>> Thanks guys. >>> >>> 2021년 1월 27일 (수) 오전 5:26, Terry Kim <yumin...@gmail.com>님이 작성: >>> >>> Hi, >>> >>> Please check if the following regression should be included: >>> https://github.com/apache/spark/pull/31352 >>> >>> Thanks, >>> Terry >>> >>> On Tue, Jan 26, 2021 at 7:54 AM Holden Karau <hol...@pigscanfly.ca> >>> wrote: >>> >>> If were ok waiting for it, I’d like to get >>> https://github.com/apache/spark/pull/31298 in as well (it’s not a >>> regression but it is a bug fix). >>> >>> On Tue, Jan 26, 2021 at 6:38 AM Hyukjin Kwon <gurwls...@gmail.com> >>> wrote: >>> >>> It looks like a cool one but it's a pretty big one and affects the plans >>> considerably ... maybe it's best to avoid adding it into 3.1.1 in >>> particular during the RC period if this isn't a clear regression that >>> affects many users. >>> >>> 2021년 1월 26일 (화) 오후 11:23, Peter Toth <peter.t...@gmail.com>님이 작성: >>> >>> Hey, >>> >>> Sorry for chiming in a bit late, but I would like to suggest my PR ( >>> https://github.com/apache/spark/pull/28885) for review and inclusion >>> into 3.1.1. >>> >>> Currently, invalid reuse reference nodes appear in many queries, causing >>> performance issues and incorrect explain plans. Now that >>> https://github.com/apache/spark/pull/31243 got merged these invalid >>> references can be easily found in many of our golden files on master: >>> https://github.com/apache/spark/pull/28885#issuecomment-767530441. >>> But the issue isn't master (3.2) specific, actually it has been there >>> since 3.0 when Dynamic Partition Pruning was added. >>> So it is not a regression from 3.0 to 3.1.1, but in some cases (like >>> TPCDS q23b) it is causing performance regression from 2.4 to 3.x. >>> >>> Thanks, >>> Peter >>> >>> On Tue, Jan 26, 2021 at 6:30 AM Hyukjin Kwon <gurwls...@gmail.com> >>> wrote: >>> >>> Guys, I plan to make an RC as soon as we have no visible issues. I have >>> merged a few correctness issues. There look: >>> - https://github.com/apache/spark/pull/31319 waiting for a review (I >>> will do it too soon). >>> - https://github.com/apache/spark/pull/31336 >>> - I know Max's investigating the perf regression one which hopefully >>> will be fixed soon. >>> >>> Are there any more blockers or correctness issues? Please ping me or say >>> it out here. >>> I would like to avoid making an RC when there are clearly some issues to >>> be fixed. >>> If you're investigating something suspicious, that's fine too. It's >>> better to make sure we're safe instead of rushing an RC without finishing >>> the investigation. >>> >>> Thanks all. >>> >>> >>> 2021년 1월 22일 (금) 오후 6:19, Hyukjin Kwon <gurwls...@gmail.com>님이 작성: >>> >>> Sure, thanks guys. I'll start another RC after the fixes. Looks like >>> we're almost there. >>> >>> On Fri, 22 Jan 2021, 17:47 Wenchen Fan, <cloud0...@gmail.com> wrote: >>> >>> BTW, there is a correctness bug being fixed at >>> https://github.com/apache/spark/pull/30788 . It's not a regression, but >>> the fix is very simple and it would be better to start the next RC after >>> merging that fix. >>> >>> On Fri, Jan 22, 2021 at 3:54 PM Maxim Gekk <maxim.g...@databricks.com> >>> wrote: >>> >>> Also I am investigating a performance regression in some TPC-DS queries >>> (q88 for instance) that is caused by a recent commit in 3.1, highly likely >>> in the period from 19th November, 2020 to 18th December, 2020. >>> >>> Maxim Gekk >>> >>> Software Engineer >>> >>> Databricks, Inc. >>> >>> >>> On Fri, Jan 22, 2021 at 10:45 AM Wenchen Fan <cloud0...@gmail.com> >>> wrote: >>> >>> -1 as I just found a regression in 3.1. A self-join query works well in >>> 3.0 but fails in 3.1. It's being fixed at >>> https://github.com/apache/spark/pull/31287 >>> >>> On Fri, Jan 22, 2021 at 4:34 AM Tom Graves <tgraves...@yahoo.com.invalid> >>> wrote: >>> >>> +1 >>> >>> built from tarball, verified sha and regular CI and tests all pass. >>> >>> Tom >>> >>> On Monday, January 18, 2021, 06:06:42 AM CST, Hyukjin Kwon < >>> gurwls...@gmail.com> wrote: >>> >>> >>> Please vote on releasing the following candidate as Apache Spark version >>> 3.1.1. >>> >>> The vote is open until January 22nd 4PM PST and passes if a majority +1 >>> PMC votes are cast, with a minimum of 3 +1 votes. >>> >>> [ ] +1 Release this package as Apache Spark 3.1.0 >>> [ ] -1 Do not release this package because ... >>> >>> To learn more about Apache Spark, please see http://spark.apache.org/ >>> >>> The tag to be voted on is v3.1.1-rc1 (commit >>> 53fe365edb948d0e05a5ccb62f349cd9fcb4bb5d): >>> https://github.com/apache/spark/tree/v3.1.1-rc1 >>> >>> The release files, including signatures, digests, etc. can be found at: >>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/ >>> >>> Signatures used for Spark RCs can be found in this file: >>> https://dist.apache.org/repos/dist/dev/spark/KEYS >>> >>> The staging repository for this release can be found at: >>> https://repository.apache.org/content/repositories/orgapachespark-1364 >>> >>> The documentation corresponding to this release can be found at: >>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-docs/ >>> >>> The list of bug fixes going into 3.1.1 can be found at the following URL: >>> https://s.apache.org/41kf2 >>> >>> This release is using the release script of the tag v3.1.1-rc1. >>> >>> FAQ >>> >>> =================== >>> What happened to 3.1.0? >>> =================== >>> >>> There was a technical issue during Apache Spark 3.1.0 preparation, and >>> it was discussed and decided to skip 3.1.0. >>> Please see >>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html >>> for more details. >>> >>> ========================= >>> How can I help test this release? >>> ========================= >>> >>> If you are a Spark user, you can help us test this release by taking >>> an existing Spark workload and running on this release candidate, then >>> reporting any regressions. >>> >>> If you're working in PySpark you can set up a virtual env and install >>> the current RC via "pip install >>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/pyspark-3.1.1.tar.gz >>> " >>> and see if anything important breaks. >>> In the Java/Scala, you can add the staging repository to your projects >>> resolvers and test >>> with the RC (make sure to clean up the artifact cache before/after so >>> you don't end up building with an out of date RC going forward). >>> >>> =========================================== >>> What should happen to JIRA tickets still targeting 3.1.1? >>> =========================================== >>> >>> The current list of open tickets targeted at 3.1.1 can be found at: >>> https://issues.apache.org/jira/projects/SPARK and search for "Target >>> Version/s" = 3.1.1 >>> >>> Committers should look at those and triage. Extremely important bug >>> fixes, documentation, and API tweaks that impact compatibility should >>> be worked on immediately. Everything else please retarget to an >>> appropriate release. >>> >>> ================== >>> But my bug isn't fixed? >>> ================== >>> >>> In order to make timely releases, we will typically not hold the >>> release unless the bug in question is a regression from the previous >>> release. That being said, if there is something which is a regression >>> that has not been correctly targeted please ping me or a committer to >>> help target the issue. >>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> Books (Learning Spark, High Performance Spark, etc.): >>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>> >>>