Thanks for summary! I have a question that is semi-related - What's the process to propose a feature to be included in the final Spark 3.0 release?
In particular, I am interested in https://issues.apache.org/jira/browse/SPARK-28006. I am happy to do the work so want to make sure I don't miss the "cut" date. On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang <jiangxb1...@gmail.com> wrote: > Hi all, > > Thanks for all the feedbacks, here is the updated feature list: > > SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> Multiple > columns support added to various Transformers: StringIndexer > > SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150> Implement > Dynamic Partition Pruning > > SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> Support > Tree-Based Feature Transformation > > SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add > MultilabelClassificationEvaluator > > SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add > sample weights to decision trees > > SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> Pushing > Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc. > > SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API for > Power Iteration Clustering > > SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> Improve > logic for timing out executors in dynamic allocation > > SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636> Eliminate > unnecessary shuffle with adjacent Window expressions > > SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> Acquire > new executors to avoid hang because of blacklisting > > SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> Multiple > columns support added to various Transformers: PySpark QuantileDiscretizer > > SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new > approach to do adaptive execution in Spark SQL > > SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> Apply > custom log URL pattern for executor log URLs in SHS > > SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add > support for Kafka headers > > SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add Spark > ML Listener for Tracking ML Pipeline Status > > SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> Upgrade > the built-in Hive to 2.3.5 for hadoop-3.2 > > SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add fit > with validation set to Gradient Boosted Trees: Python API > > SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build and > Run Spark on JDK11 > > SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615> > Accelerator-aware task scheduling for Spark > > SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow > sharing Netty's memory pool allocators > > SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix race > condition with tasks running when new attempt for same stage is created > leads to other task in the next attempt running on the same partition id > retry multiple times > > SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> Support > rolling back a shuffle map stage and re-generate the shuffle files > > SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data > source for binary files > > SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add kafka > delegation token support > > SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603> > Generalize Nested Column Pruning > > SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove > support for Scala 2.11 in Spark 3.0.0 > > SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define > reserved keywords after SQL standard > > SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow > Pandas UDF to take an iterator of pd.DataFrames > > SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow > optimization in SparkR's interoperability > > SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data > source v2 API refactor: streaming write > > SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848> Introduce > new option to Kafka source: offset by timestamp (starting/ending) > > SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> remove > streaming output mode from data source v2 APIs > > SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create > StreamingWrite at the beginning of streaming execution > > SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not > infer schema when reading Hive serde table with native data source > > SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225> Implement > join strategy hints > > SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use > pandas DataFrame for struct type argument in Scalar Pandas UDF > > SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix > deadlock between TaskMemoryManager and > UnsafeExternalSorter$SpillableIterator > > SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public > APIs for extended Columnar Processing Support > > SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> Support > Dataframe Cogroup via Pandas UDFs > > SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589> > Re-implement file sources with data source V2 API > > SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677> > Disk-persisted RDD blocks served by shuffle service, and ignored for > Dynamic Allocation > > SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699> Partially > push down disjunctive predicated in Parquet/ORC > > SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port test > cases from PostgreSQL to Spark SQL > > SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884> Deprecate > Python 2 support > > SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> Convert > applicable *.sql tests into UDF integrated test base > > SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow > dynamic allocation without an external shuffle service > > SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust > post shuffle partition number in adaptive execution > > SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move > Trigger implementations to Triggers.scala and avoid exposing these to the > end users > > SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> Document > Spark WEB UI > > SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399> > RobustScaler feature transformer > > SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> Metadata > Handling in Thrift Server > > SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build a > SQL reference doc > > SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> Improve > test coverage of ThriftServer > > SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753> > Dynamically reuse subqueries in AQE > > SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove > outdated Experimental, Evolving annotations > SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908> > SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> Remove > deprecated items since <= 2.2.0 > > Cheers, > > Xingbo > > Hyukjin Kwon <gurwls...@gmail.com> 于2019年10月7日周一 下午9:29写道: > >> Cogroup Pandas UDF missing: >> >> SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> Support >> Dataframe Cogroup via Pandas UDFs >> Vectorized R execution: >> >> SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow >> optimization in SparkR's interoperability >> >> >> 2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 >> 작성: >> >>> Thanks for bringing the nice summary of Spark 3.0 improvements! >>> >>> I'd like to add some items from structured streaming side, >>> >>> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move >>> Trigger implementations to Triggers.scala and avoid exposing these to the >>> end users (removal of deprecated) >>> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add >>> support for Kafka headers in Structured Streaming >>> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add >>> kafka delegation token support (there were follow-up issues to add >>> functionalities like support multi clusters, etc.) >>> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848> >>> Introduce new option to Kafka source: offset by timestamp (starting/ending) >>> SPARK-28074 <https://issues.apache.org/jira/browse/SPARK-28074> Log >>> warn message on possible correctness issue for multiple stateful operations >>> in single query >>> >>> and core side, >>> >>> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> New >>> feature: apply custom log URL pattern for executor log URLs in SHS >>> (follow-up issue expanded the functionality to Spark UI as well) >>> >>> FYI if we count on current work in progress, there's ongoing umbrella >>> issue regarding rolling event log & snapshot (SPARK-28594 >>> <https://issues.apache.org/jira/browse/SPARK-28594>) which we struggle >>> to get things done in Spark 3.0. >>> >>> Thanks, >>> Jungtaek Lim (HeartSaVioR) >>> >>> >>> On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <jiangxb1...@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> I went over all the finished JIRA tickets targeted to Spark 3.0.0, here >>>> I'm listing all the notable features and major changes that are ready to >>>> test/deliver, please don't hesitate to add more to the list: >>>> >>>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> >>>> Multiple columns support added to various Transformers: StringIndexer >>>> >>>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150> >>>> Implement Dynamic Partition Pruning >>>> >>>> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> >>>> Support Tree-Based Feature Transformation >>>> >>>> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add >>>> MultilabelClassificationEvaluator >>>> >>>> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add >>>> sample weights to decision trees >>>> >>>> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> >>>> Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, >>>> Union etc. >>>> >>>> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API >>>> for Power Iteration Clustering >>>> >>>> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> >>>> Improve logic for timing out executors in dynamic allocation >>>> >>>> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636> >>>> Eliminate unnecessary shuffle with adjacent Window expressions >>>> >>>> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> >>>> Acquire new executors to avoid hang because of blacklisting >>>> >>>> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> >>>> Multiple columns support added to various Transformers: PySpark >>>> QuantileDiscretizer >>>> >>>> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new >>>> approach to do adaptive execution in Spark SQL >>>> >>>> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add >>>> Spark ML Listener for Tracking ML Pipeline Status >>>> >>>> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> >>>> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 >>>> >>>> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add >>>> fit with validation set to Gradient Boosted Trees: Python API >>>> >>>> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build >>>> and Run Spark on JDK11 >>>> >>>> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615> >>>> Accelerator-aware task scheduling for Spark >>>> >>>> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow >>>> sharing Netty's memory pool allocators >>>> >>>> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix >>>> race condition with tasks running when new attempt for same stage is >>>> created leads to other task in the next attempt running on the same >>>> partition id retry multiple times >>>> >>>> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> >>>> Support rolling back a shuffle map stage and re-generate the shuffle files >>>> >>>> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data >>>> source for binary files >>>> >>>> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603> >>>> Generalize Nested Column Pruning >>>> >>>> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove >>>> support for Scala 2.11 in Spark 3.0.0 >>>> >>>> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define >>>> reserved keywords after SQL standard >>>> >>>> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow >>>> Pandas UDF to take an iterator of pd.DataFrames >>>> >>>> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data >>>> source v2 API refactor: streaming write >>>> >>>> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> remove >>>> streaming output mode from data source v2 APIs >>>> >>>> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create >>>> StreamingWrite at the beginning of streaming execution >>>> >>>> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not >>>> infer schema when reading Hive serde table with native data source >>>> >>>> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225> >>>> Implement join strategy hints >>>> >>>> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use >>>> pandas DataFrame for struct type argument in Scalar Pandas UDF >>>> >>>> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix >>>> deadlock between TaskMemoryManager and >>>> UnsafeExternalSorter$SpillableIterator >>>> >>>> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public >>>> APIs for extended Columnar Processing Support >>>> >>>> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589> >>>> Re-implement file sources with data source V2 API >>>> >>>> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677> >>>> Disk-persisted RDD blocks served by shuffle service, and ignored for >>>> Dynamic Allocation >>>> >>>> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699> >>>> Partially push down disjunctive predicated in Parquet/ORC >>>> >>>> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port >>>> test cases from PostgreSQL to Spark SQL (ongoing) >>>> >>>> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884> >>>> Deprecate Python 2 support >>>> >>>> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> >>>> Convert applicable *.sql tests into UDF integrated test base >>>> >>>> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow >>>> dynamic allocation without an external shuffle service >>>> >>>> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust >>>> post shuffle partition number in adaptive execution >>>> >>>> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> >>>> Document Spark WEB UI >>>> >>>> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399> >>>> RobustScaler feature transformer >>>> >>>> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> >>>> Metadata Handling in Thrift Server >>>> >>>> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build >>>> a SQL reference doc (ongoing) >>>> >>>> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> >>>> Improve test coverage of ThriftServer >>>> >>>> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753> >>>> Dynamically reuse subqueries in AQE >>>> >>>> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove >>>> outdated Experimental, Evolving annotations >>>> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908> >>>> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> Remove >>>> deprecated items since <= 2.2.0 >>>> >>>> Cheers, >>>> >>>> Xingbo >>>> >>>