Hi all, Thanks for all the feedbacks, here is the updated feature list:
SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> Multiple columns support added to various Transformers: StringIndexer SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150> Implement Dynamic Partition Pruning SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> Support Tree-Based Feature Transformation SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add MultilabelClassificationEvaluator SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add sample weights to decision trees SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc. SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API for Power Iteration Clustering SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> Improve logic for timing out executors in dynamic allocation SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636> Eliminate unnecessary shuffle with adjacent Window expressions SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> Acquire new executors to avoid hang because of blacklisting SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> Multiple columns support added to various Transformers: PySpark QuantileDiscretizer SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new approach to do adaptive execution in Spark SQL SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> Apply custom log URL pattern for executor log URLs in SHS SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add support for Kafka headers SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add Spark ML Listener for Tracking ML Pipeline Status SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add fit with validation set to Gradient Boosted Trees: Python API SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build and Run Spark on JDK11 SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615> Accelerator-aware task scheduling for Spark SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow sharing Netty's memory pool allocators SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> Support rolling back a shuffle map stage and re-generate the shuffle files SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data source for binary files SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add kafka delegation token support SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603> Generalize Nested Column Pruning SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove support for Scala 2.11 in Spark 3.0.0 SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define reserved keywords after SQL standard SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow Pandas UDF to take an iterator of pd.DataFrames SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow optimization in SparkR's interoperability SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data source v2 API refactor: streaming write SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848> Introduce new option to Kafka source: offset by timestamp (starting/ending) SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> remove streaming output mode from data source v2 APIs SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create StreamingWrite at the beginning of streaming execution SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not infer schema when reading Hive serde table with native data source SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225> Implement join strategy hints SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use pandas DataFrame for struct type argument in Scalar Pandas UDF SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public APIs for extended Columnar Processing Support SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> Support Dataframe Cogroup via Pandas UDFs SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589> Re-implement file sources with data source V2 API SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677> Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699> Partially push down disjunctive predicated in Parquet/ORC SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port test cases from PostgreSQL to Spark SQL SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884> Deprecate Python 2 support SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> Convert applicable *.sql tests into UDF integrated test base SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow dynamic allocation without an external shuffle service SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust post shuffle partition number in adaptive execution SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move Trigger implementations to Triggers.scala and avoid exposing these to the end users SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> Document Spark WEB UI SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399> RobustScaler feature transformer SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> Metadata Handling in Thrift Server SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build a SQL reference doc SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> Improve test coverage of ThriftServer SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753> Dynamically reuse subqueries in AQE SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove outdated Experimental, Evolving annotations SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> Remove deprecated items since <= 2.2.0 Cheers, Xingbo Hyukjin Kwon <gurwls...@gmail.com> 于2019年10月7日周一 下午9:29写道: > Cogroup Pandas UDF missing: > > SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> Support > Dataframe Cogroup via Pandas UDFs > Vectorized R execution: > > SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow > optimization in SparkR's interoperability > > > 2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 > 작성: > >> Thanks for bringing the nice summary of Spark 3.0 improvements! >> >> I'd like to add some items from structured streaming side, >> >> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move >> Trigger implementations to Triggers.scala and avoid exposing these to the >> end users (removal of deprecated) >> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add >> support for Kafka headers in Structured Streaming >> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add >> kafka delegation token support (there were follow-up issues to add >> functionalities like support multi clusters, etc.) >> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848> >> Introduce new option to Kafka source: offset by timestamp (starting/ending) >> SPARK-28074 <https://issues.apache.org/jira/browse/SPARK-28074> Log warn >> message on possible correctness issue for multiple stateful operations in >> single query >> >> and core side, >> >> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> New >> feature: apply custom log URL pattern for executor log URLs in SHS >> (follow-up issue expanded the functionality to Spark UI as well) >> >> FYI if we count on current work in progress, there's ongoing umbrella >> issue regarding rolling event log & snapshot (SPARK-28594 >> <https://issues.apache.org/jira/browse/SPARK-28594>) which we struggle >> to get things done in Spark 3.0. >> >> Thanks, >> Jungtaek Lim (HeartSaVioR) >> >> >> On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <jiangxb1...@gmail.com> >> wrote: >> >>> Hi all, >>> >>> I went over all the finished JIRA tickets targeted to Spark 3.0.0, here >>> I'm listing all the notable features and major changes that are ready to >>> test/deliver, please don't hesitate to add more to the list: >>> >>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> >>> Multiple columns support added to various Transformers: StringIndexer >>> >>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150> >>> Implement Dynamic Partition Pruning >>> >>> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> Support >>> Tree-Based Feature Transformation >>> >>> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add >>> MultilabelClassificationEvaluator >>> >>> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add >>> sample weights to decision trees >>> >>> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> Pushing >>> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc. >>> >>> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API >>> for Power Iteration Clustering >>> >>> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> Improve >>> logic for timing out executors in dynamic allocation >>> >>> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636> >>> Eliminate unnecessary shuffle with adjacent Window expressions >>> >>> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> Acquire >>> new executors to avoid hang because of blacklisting >>> >>> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> >>> Multiple columns support added to various Transformers: PySpark >>> QuantileDiscretizer >>> >>> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new >>> approach to do adaptive execution in Spark SQL >>> >>> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add >>> Spark ML Listener for Tracking ML Pipeline Status >>> >>> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> Upgrade >>> the built-in Hive to 2.3.5 for hadoop-3.2 >>> >>> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add fit >>> with validation set to Gradient Boosted Trees: Python API >>> >>> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build >>> and Run Spark on JDK11 >>> >>> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615> >>> Accelerator-aware task scheduling for Spark >>> >>> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow >>> sharing Netty's memory pool allocators >>> >>> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix >>> race condition with tasks running when new attempt for same stage is >>> created leads to other task in the next attempt running on the same >>> partition id retry multiple times >>> >>> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> Support >>> rolling back a shuffle map stage and re-generate the shuffle files >>> >>> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data >>> source for binary files >>> >>> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603> >>> Generalize Nested Column Pruning >>> >>> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove >>> support for Scala 2.11 in Spark 3.0.0 >>> >>> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define >>> reserved keywords after SQL standard >>> >>> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow >>> Pandas UDF to take an iterator of pd.DataFrames >>> >>> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data >>> source v2 API refactor: streaming write >>> >>> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> remove >>> streaming output mode from data source v2 APIs >>> >>> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create >>> StreamingWrite at the beginning of streaming execution >>> >>> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not >>> infer schema when reading Hive serde table with native data source >>> >>> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225> >>> Implement join strategy hints >>> >>> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use >>> pandas DataFrame for struct type argument in Scalar Pandas UDF >>> >>> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix >>> deadlock between TaskMemoryManager and >>> UnsafeExternalSorter$SpillableIterator >>> >>> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public >>> APIs for extended Columnar Processing Support >>> >>> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589> >>> Re-implement file sources with data source V2 API >>> >>> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677> >>> Disk-persisted RDD blocks served by shuffle service, and ignored for >>> Dynamic Allocation >>> >>> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699> >>> Partially push down disjunctive predicated in Parquet/ORC >>> >>> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port >>> test cases from PostgreSQL to Spark SQL (ongoing) >>> >>> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884> >>> Deprecate Python 2 support >>> >>> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> Convert >>> applicable *.sql tests into UDF integrated test base >>> >>> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow >>> dynamic allocation without an external shuffle service >>> >>> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust >>> post shuffle partition number in adaptive execution >>> >>> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> >>> Document Spark WEB UI >>> >>> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399> >>> RobustScaler feature transformer >>> >>> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> >>> Metadata Handling in Thrift Server >>> >>> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build a >>> SQL reference doc (ongoing) >>> >>> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> Improve >>> test coverage of ThriftServer >>> >>> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753> >>> Dynamically reuse subqueries in AQE >>> >>> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove >>> outdated Experimental, Evolving annotations >>> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908> >>> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> Remove >>> deprecated items since <= 2.2.0 >>> >>> Cheers, >>> >>> Xingbo >>> >>