Thank you for the preparation of 3.0-preview, Xingbo! Bests, Dongjoon.
On Tue, Oct 8, 2019 at 2:32 PM Xingbo Jiang <jiangxb1...@gmail.com> wrote: > What's the process to propose a feature to be included in the final Spark >> 3.0 release? >> > > I don't know whether there exists any specific process here, normally you > just merge the feature into Spark master before release code freeze, and > then the feature would probably be included in the release. The code freeze > date for Spark 3.0 has not been decided yet, though. > > Li Jin <ice.xell...@gmail.com> 于2019年10月8日周二 下午2:14写道: > >> Thanks for summary! >> >> I have a question that is semi-related - What's the process to propose a >> feature to be included in the final Spark 3.0 release? >> >> In particular, I am interested in >> https://issues.apache.org/jira/browse/SPARK-28006. I am happy to do the >> work so want to make sure I don't miss the "cut" date. >> >> On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang <jiangxb1...@gmail.com> >> wrote: >> >>> Hi all, >>> >>> Thanks for all the feedbacks, here is the updated feature list: >>> >>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> >>> Multiple columns support added to various Transformers: StringIndexer >>> >>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150> >>> Implement Dynamic Partition Pruning >>> >>> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> Support >>> Tree-Based Feature Transformation >>> >>> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add >>> MultilabelClassificationEvaluator >>> >>> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add >>> sample weights to decision trees >>> >>> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> Pushing >>> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc. >>> >>> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API >>> for Power Iteration Clustering >>> >>> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> Improve >>> logic for timing out executors in dynamic allocation >>> >>> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636> >>> Eliminate unnecessary shuffle with adjacent Window expressions >>> >>> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> Acquire >>> new executors to avoid hang because of blacklisting >>> >>> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> >>> Multiple columns support added to various Transformers: PySpark >>> QuantileDiscretizer >>> >>> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new >>> approach to do adaptive execution in Spark SQL >>> >>> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> Apply >>> custom log URL pattern for executor log URLs in SHS >>> >>> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add >>> support for Kafka headers >>> >>> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add >>> Spark ML Listener for Tracking ML Pipeline Status >>> >>> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> Upgrade >>> the built-in Hive to 2.3.5 for hadoop-3.2 >>> >>> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add fit >>> with validation set to Gradient Boosted Trees: Python API >>> >>> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build >>> and Run Spark on JDK11 >>> >>> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615> >>> Accelerator-aware task scheduling for Spark >>> >>> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow >>> sharing Netty's memory pool allocators >>> >>> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix >>> race condition with tasks running when new attempt for same stage is >>> created leads to other task in the next attempt running on the same >>> partition id retry multiple times >>> >>> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> Support >>> rolling back a shuffle map stage and re-generate the shuffle files >>> >>> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data >>> source for binary files >>> >>> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add >>> kafka delegation token support >>> >>> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603> >>> Generalize Nested Column Pruning >>> >>> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove >>> support for Scala 2.11 in Spark 3.0.0 >>> >>> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define >>> reserved keywords after SQL standard >>> >>> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow >>> Pandas UDF to take an iterator of pd.DataFrames >>> >>> SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow >>> optimization in SparkR's interoperability >>> >>> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data >>> source v2 API refactor: streaming write >>> >>> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848> >>> Introduce new option to Kafka source: offset by timestamp (starting/ending) >>> >>> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> remove >>> streaming output mode from data source v2 APIs >>> >>> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create >>> StreamingWrite at the beginning of streaming execution >>> >>> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not >>> infer schema when reading Hive serde table with native data source >>> >>> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225> >>> Implement join strategy hints >>> >>> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use >>> pandas DataFrame for struct type argument in Scalar Pandas UDF >>> >>> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix >>> deadlock between TaskMemoryManager and >>> UnsafeExternalSorter$SpillableIterator >>> >>> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public >>> APIs for extended Columnar Processing Support >>> >>> SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> Support >>> Dataframe Cogroup via Pandas UDFs >>> >>> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589> >>> Re-implement file sources with data source V2 API >>> >>> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677> >>> Disk-persisted RDD blocks served by shuffle service, and ignored for >>> Dynamic Allocation >>> >>> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699> >>> Partially push down disjunctive predicated in Parquet/ORC >>> >>> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port >>> test cases from PostgreSQL to Spark SQL >>> >>> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884> >>> Deprecate Python 2 support >>> >>> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> Convert >>> applicable *.sql tests into UDF integrated test base >>> >>> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow >>> dynamic allocation without an external shuffle service >>> >>> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust >>> post shuffle partition number in adaptive execution >>> >>> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move >>> Trigger implementations to Triggers.scala and avoid exposing these to the >>> end users >>> >>> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> >>> Document Spark WEB UI >>> >>> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399> >>> RobustScaler feature transformer >>> >>> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> >>> Metadata Handling in Thrift Server >>> >>> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build a >>> SQL reference doc >>> >>> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> Improve >>> test coverage of ThriftServer >>> >>> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753> >>> Dynamically reuse subqueries in AQE >>> >>> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove >>> outdated Experimental, Evolving annotations >>> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908> >>> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> Remove >>> deprecated items since <= 2.2.0 >>> >>> Cheers, >>> >>> Xingbo >>> >>> Hyukjin Kwon <gurwls...@gmail.com> 于2019年10月7日周一 下午9:29写道: >>> >>>> Cogroup Pandas UDF missing: >>>> >>>> SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> >>>> Support Dataframe Cogroup via Pandas UDFs >>>> Vectorized R execution: >>>> >>>> SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow >>>> optimization in SparkR's interoperability >>>> >>>> >>>> 2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 >>>> 작성: >>>> >>>>> Thanks for bringing the nice summary of Spark 3.0 improvements! >>>>> >>>>> I'd like to add some items from structured streaming side, >>>>> >>>>> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move >>>>> Trigger implementations to Triggers.scala and avoid exposing these to the >>>>> end users (removal of deprecated) >>>>> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add >>>>> support for Kafka headers in Structured Streaming >>>>> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add >>>>> kafka delegation token support (there were follow-up issues to add >>>>> functionalities like support multi clusters, etc.) >>>>> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848> >>>>> Introduce new option to Kafka source: offset by timestamp >>>>> (starting/ending) >>>>> SPARK-28074 <https://issues.apache.org/jira/browse/SPARK-28074> Log >>>>> warn message on possible correctness issue for multiple stateful >>>>> operations >>>>> in single query >>>>> >>>>> and core side, >>>>> >>>>> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> New >>>>> feature: apply custom log URL pattern for executor log URLs in SHS >>>>> (follow-up issue expanded the functionality to Spark UI as well) >>>>> >>>>> FYI if we count on current work in progress, there's ongoing umbrella >>>>> issue regarding rolling event log & snapshot (SPARK-28594 >>>>> <https://issues.apache.org/jira/browse/SPARK-28594>) which we >>>>> struggle to get things done in Spark 3.0. >>>>> >>>>> Thanks, >>>>> Jungtaek Lim (HeartSaVioR) >>>>> >>>>> >>>>> On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <jiangxb1...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I went over all the finished JIRA tickets targeted to Spark 3.0.0, >>>>>> here I'm listing all the notable features and major changes that are >>>>>> ready >>>>>> to test/deliver, please don't hesitate to add more to the list: >>>>>> >>>>>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> >>>>>> Multiple columns support added to various Transformers: StringIndexer >>>>>> >>>>>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150> >>>>>> Implement Dynamic Partition Pruning >>>>>> >>>>>> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> >>>>>> Support Tree-Based Feature Transformation >>>>>> >>>>>> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add >>>>>> MultilabelClassificationEvaluator >>>>>> >>>>>> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add >>>>>> sample weights to decision trees >>>>>> >>>>>> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> >>>>>> Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, >>>>>> Union etc. >>>>>> >>>>>> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R >>>>>> API for Power Iteration Clustering >>>>>> >>>>>> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> >>>>>> Improve logic for timing out executors in dynamic allocation >>>>>> >>>>>> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636> >>>>>> Eliminate unnecessary shuffle with adjacent Window expressions >>>>>> >>>>>> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> >>>>>> Acquire new executors to avoid hang because of blacklisting >>>>>> >>>>>> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> >>>>>> Multiple columns support added to various Transformers: PySpark >>>>>> QuantileDiscretizer >>>>>> >>>>>> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A >>>>>> new approach to do adaptive execution in Spark SQL >>>>>> >>>>>> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add >>>>>> Spark ML Listener for Tracking ML Pipeline Status >>>>>> >>>>>> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> >>>>>> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 >>>>>> >>>>>> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add >>>>>> fit with validation set to Gradient Boosted Trees: Python API >>>>>> >>>>>> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> >>>>>> Build and Run Spark on JDK11 >>>>>> >>>>>> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615> >>>>>> Accelerator-aware task scheduling for Spark >>>>>> >>>>>> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> >>>>>> Allow sharing Netty's memory pool allocators >>>>>> >>>>>> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix >>>>>> race condition with tasks running when new attempt for same stage is >>>>>> created leads to other task in the next attempt running on the same >>>>>> partition id retry multiple times >>>>>> >>>>>> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> >>>>>> Support rolling back a shuffle map stage and re-generate the shuffle >>>>>> files >>>>>> >>>>>> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data >>>>>> source for binary files >>>>>> >>>>>> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603> >>>>>> Generalize Nested Column Pruning >>>>>> >>>>>> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> >>>>>> Remove support for Scala 2.11 in Spark 3.0.0 >>>>>> >>>>>> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> >>>>>> define reserved keywords after SQL standard >>>>>> >>>>>> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> >>>>>> Allow Pandas UDF to take an iterator of pd.DataFrames >>>>>> >>>>>> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data >>>>>> source v2 API refactor: streaming write >>>>>> >>>>>> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> >>>>>> remove streaming output mode from data source v2 APIs >>>>>> >>>>>> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> >>>>>> create StreamingWrite at the beginning of streaming execution >>>>>> >>>>>> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do >>>>>> not infer schema when reading Hive serde table with native data source >>>>>> >>>>>> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225> >>>>>> Implement join strategy hints >>>>>> >>>>>> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use >>>>>> pandas DataFrame for struct type argument in Scalar Pandas UDF >>>>>> >>>>>> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix >>>>>> deadlock between TaskMemoryManager and >>>>>> UnsafeExternalSorter$SpillableIterator >>>>>> >>>>>> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> >>>>>> Public APIs for extended Columnar Processing Support >>>>>> >>>>>> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589> >>>>>> Re-implement file sources with data source V2 API >>>>>> >>>>>> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677> >>>>>> Disk-persisted RDD blocks served by shuffle service, and ignored for >>>>>> Dynamic Allocation >>>>>> >>>>>> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699> >>>>>> Partially push down disjunctive predicated in Parquet/ORC >>>>>> >>>>>> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port >>>>>> test cases from PostgreSQL to Spark SQL (ongoing) >>>>>> >>>>>> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884> >>>>>> Deprecate Python 2 support >>>>>> >>>>>> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> >>>>>> Convert applicable *.sql tests into UDF integrated test base >>>>>> >>>>>> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> >>>>>> Allow dynamic allocation without an external shuffle service >>>>>> >>>>>> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> >>>>>> Adjust post shuffle partition number in adaptive execution >>>>>> >>>>>> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> >>>>>> Document Spark WEB UI >>>>>> >>>>>> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399> >>>>>> RobustScaler feature transformer >>>>>> >>>>>> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> >>>>>> Metadata Handling in Thrift Server >>>>>> >>>>>> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> >>>>>> Build a SQL reference doc (ongoing) >>>>>> >>>>>> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> >>>>>> Improve test coverage of ThriftServer >>>>>> >>>>>> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753> >>>>>> Dynamically reuse subqueries in AQE >>>>>> >>>>>> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> >>>>>> Remove outdated Experimental, Evolving annotations >>>>>> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908> >>>>>> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> >>>>>> Remove deprecated items since <= 2.2.0 >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Xingbo >>>>>> >>>>>