Regarding DS v2, I'd like to remove SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data source v2 API refactor: streaming write SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> remove streaming output mode from data source v2 APIs
and put the umbrella ticket instead SPARK-25390 <https://issues.apache.org/jira/browse/SPARK-25390> data source V2 API refactoring Thanks, Wenchen On Wed, Oct 9, 2019 at 1:19 PM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote: > Thank you for the preparation of 3.0-preview, Xingbo! > > Bests, > Dongjoon. > > On Tue, Oct 8, 2019 at 2:32 PM Xingbo Jiang <jiangxb1...@gmail.com> wrote: > >> What's the process to propose a feature to be included in the final >>> Spark 3.0 release? >>> >> >> I don't know whether there exists any specific process here, normally you >> just merge the feature into Spark master before release code freeze, and >> then the feature would probably be included in the release. The code freeze >> date for Spark 3.0 has not been decided yet, though. >> >> Li Jin <ice.xell...@gmail.com> 于2019年10月8日周二 下午2:14写道: >> >>> Thanks for summary! >>> >>> I have a question that is semi-related - What's the process to propose a >>> feature to be included in the final Spark 3.0 release? >>> >>> In particular, I am interested in >>> https://issues.apache.org/jira/browse/SPARK-28006. I am happy to do >>> the work so want to make sure I don't miss the "cut" date. >>> >>> On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang <jiangxb1...@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> Thanks for all the feedbacks, here is the updated feature list: >>>> >>>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> >>>> Multiple columns support added to various Transformers: StringIndexer >>>> >>>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150> >>>> Implement Dynamic Partition Pruning >>>> >>>> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> >>>> Support Tree-Based Feature Transformation >>>> >>>> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add >>>> MultilabelClassificationEvaluator >>>> >>>> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add >>>> sample weights to decision trees >>>> >>>> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> >>>> Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, >>>> Union etc. >>>> >>>> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API >>>> for Power Iteration Clustering >>>> >>>> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> >>>> Improve logic for timing out executors in dynamic allocation >>>> >>>> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636> >>>> Eliminate unnecessary shuffle with adjacent Window expressions >>>> >>>> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> >>>> Acquire new executors to avoid hang because of blacklisting >>>> >>>> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> >>>> Multiple columns support added to various Transformers: PySpark >>>> QuantileDiscretizer >>>> >>>> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new >>>> approach to do adaptive execution in Spark SQL >>>> >>>> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> Apply >>>> custom log URL pattern for executor log URLs in SHS >>>> >>>> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add >>>> support for Kafka headers >>>> >>>> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add >>>> Spark ML Listener for Tracking ML Pipeline Status >>>> >>>> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> >>>> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 >>>> >>>> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add >>>> fit with validation set to Gradient Boosted Trees: Python API >>>> >>>> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build >>>> and Run Spark on JDK11 >>>> >>>> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615> >>>> Accelerator-aware task scheduling for Spark >>>> >>>> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow >>>> sharing Netty's memory pool allocators >>>> >>>> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix >>>> race condition with tasks running when new attempt for same stage is >>>> created leads to other task in the next attempt running on the same >>>> partition id retry multiple times >>>> >>>> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> >>>> Support rolling back a shuffle map stage and re-generate the shuffle files >>>> >>>> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data >>>> source for binary files >>>> >>>> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add >>>> kafka delegation token support >>>> >>>> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603> >>>> Generalize Nested Column Pruning >>>> >>>> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove >>>> support for Scala 2.11 in Spark 3.0.0 >>>> >>>> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define >>>> reserved keywords after SQL standard >>>> >>>> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow >>>> Pandas UDF to take an iterator of pd.DataFrames >>>> >>>> SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow >>>> optimization in SparkR's interoperability >>>> >>>> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data >>>> source v2 API refactor: streaming write >>>> >>>> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848> >>>> Introduce new option to Kafka source: offset by timestamp (starting/ending) >>>> >>>> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> remove >>>> streaming output mode from data source v2 APIs >>>> >>>> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create >>>> StreamingWrite at the beginning of streaming execution >>>> >>>> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not >>>> infer schema when reading Hive serde table with native data source >>>> >>>> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225> >>>> Implement join strategy hints >>>> >>>> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use >>>> pandas DataFrame for struct type argument in Scalar Pandas UDF >>>> >>>> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix >>>> deadlock between TaskMemoryManager and >>>> UnsafeExternalSorter$SpillableIterator >>>> >>>> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public >>>> APIs for extended Columnar Processing Support >>>> >>>> SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> >>>> Support Dataframe Cogroup via Pandas UDFs >>>> >>>> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589> >>>> Re-implement file sources with data source V2 API >>>> >>>> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677> >>>> Disk-persisted RDD blocks served by shuffle service, and ignored for >>>> Dynamic Allocation >>>> >>>> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699> >>>> Partially push down disjunctive predicated in Parquet/ORC >>>> >>>> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port >>>> test cases from PostgreSQL to Spark SQL >>>> >>>> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884> >>>> Deprecate Python 2 support >>>> >>>> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> >>>> Convert applicable *.sql tests into UDF integrated test base >>>> >>>> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow >>>> dynamic allocation without an external shuffle service >>>> >>>> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust >>>> post shuffle partition number in adaptive execution >>>> >>>> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move >>>> Trigger implementations to Triggers.scala and avoid exposing these to the >>>> end users >>>> >>>> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> >>>> Document Spark WEB UI >>>> >>>> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399> >>>> RobustScaler feature transformer >>>> >>>> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> >>>> Metadata Handling in Thrift Server >>>> >>>> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build >>>> a SQL reference doc >>>> >>>> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> >>>> Improve test coverage of ThriftServer >>>> >>>> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753> >>>> Dynamically reuse subqueries in AQE >>>> >>>> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove >>>> outdated Experimental, Evolving annotations >>>> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908> >>>> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> Remove >>>> deprecated items since <= 2.2.0 >>>> >>>> Cheers, >>>> >>>> Xingbo >>>> >>>> Hyukjin Kwon <gurwls...@gmail.com> 于2019年10月7日周一 下午9:29写道: >>>> >>>>> Cogroup Pandas UDF missing: >>>>> >>>>> SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> >>>>> Support Dataframe Cogroup via Pandas UDFs >>>>> Vectorized R execution: >>>>> >>>>> SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow >>>>> optimization in SparkR's interoperability >>>>> >>>>> >>>>> 2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 >>>>> 작성: >>>>> >>>>>> Thanks for bringing the nice summary of Spark 3.0 improvements! >>>>>> >>>>>> I'd like to add some items from structured streaming side, >>>>>> >>>>>> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move >>>>>> Trigger implementations to Triggers.scala and avoid exposing these to the >>>>>> end users (removal of deprecated) >>>>>> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add >>>>>> support for Kafka headers in Structured Streaming >>>>>> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add >>>>>> kafka delegation token support (there were follow-up issues to add >>>>>> functionalities like support multi clusters, etc.) >>>>>> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848> >>>>>> Introduce new option to Kafka source: offset by timestamp >>>>>> (starting/ending) >>>>>> SPARK-28074 <https://issues.apache.org/jira/browse/SPARK-28074> Log >>>>>> warn message on possible correctness issue for multiple stateful >>>>>> operations >>>>>> in single query >>>>>> >>>>>> and core side, >>>>>> >>>>>> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> New >>>>>> feature: apply custom log URL pattern for executor log URLs in SHS >>>>>> (follow-up issue expanded the functionality to Spark UI as well) >>>>>> >>>>>> FYI if we count on current work in progress, there's ongoing umbrella >>>>>> issue regarding rolling event log & snapshot (SPARK-28594 >>>>>> <https://issues.apache.org/jira/browse/SPARK-28594>) which we >>>>>> struggle to get things done in Spark 3.0. >>>>>> >>>>>> Thanks, >>>>>> Jungtaek Lim (HeartSaVioR) >>>>>> >>>>>> >>>>>> On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <jiangxb1...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I went over all the finished JIRA tickets targeted to Spark 3.0.0, >>>>>>> here I'm listing all the notable features and major changes that are >>>>>>> ready >>>>>>> to test/deliver, please don't hesitate to add more to the list: >>>>>>> >>>>>>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> >>>>>>> Multiple columns support added to various Transformers: StringIndexer >>>>>>> >>>>>>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150> >>>>>>> Implement Dynamic Partition Pruning >>>>>>> >>>>>>> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> >>>>>>> Support Tree-Based Feature Transformation >>>>>>> >>>>>>> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add >>>>>>> MultilabelClassificationEvaluator >>>>>>> >>>>>>> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add >>>>>>> sample weights to decision trees >>>>>>> >>>>>>> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> >>>>>>> Pushing Left Semi and Left Anti joins through Project, Aggregate, >>>>>>> Window, >>>>>>> Union etc. >>>>>>> >>>>>>> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R >>>>>>> API for Power Iteration Clustering >>>>>>> >>>>>>> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> >>>>>>> Improve logic for timing out executors in dynamic allocation >>>>>>> >>>>>>> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636> >>>>>>> Eliminate unnecessary shuffle with adjacent Window expressions >>>>>>> >>>>>>> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> >>>>>>> Acquire new executors to avoid hang because of blacklisting >>>>>>> >>>>>>> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> >>>>>>> Multiple columns support added to various Transformers: PySpark >>>>>>> QuantileDiscretizer >>>>>>> >>>>>>> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A >>>>>>> new approach to do adaptive execution in Spark SQL >>>>>>> >>>>>>> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add >>>>>>> Spark ML Listener for Tracking ML Pipeline Status >>>>>>> >>>>>>> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> >>>>>>> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 >>>>>>> >>>>>>> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add >>>>>>> fit with validation set to Gradient Boosted Trees: Python API >>>>>>> >>>>>>> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> >>>>>>> Build and Run Spark on JDK11 >>>>>>> >>>>>>> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615> >>>>>>> Accelerator-aware task scheduling for Spark >>>>>>> >>>>>>> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> >>>>>>> Allow sharing Netty's memory pool allocators >>>>>>> >>>>>>> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix >>>>>>> race condition with tasks running when new attempt for same stage is >>>>>>> created leads to other task in the next attempt running on the same >>>>>>> partition id retry multiple times >>>>>>> >>>>>>> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> >>>>>>> Support rolling back a shuffle map stage and re-generate the shuffle >>>>>>> files >>>>>>> >>>>>>> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> >>>>>>> Data source for binary files >>>>>>> >>>>>>> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603> >>>>>>> Generalize Nested Column Pruning >>>>>>> >>>>>>> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> >>>>>>> Remove support for Scala 2.11 in Spark 3.0.0 >>>>>>> >>>>>>> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> >>>>>>> define reserved keywords after SQL standard >>>>>>> >>>>>>> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> >>>>>>> Allow Pandas UDF to take an iterator of pd.DataFrames >>>>>>> >>>>>>> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> >>>>>>> data source v2 API refactor: streaming write >>>>>>> >>>>>>> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> >>>>>>> remove streaming output mode from data source v2 APIs >>>>>>> >>>>>>> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> >>>>>>> create StreamingWrite at the beginning of streaming execution >>>>>>> >>>>>>> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do >>>>>>> not infer schema when reading Hive serde table with native data source >>>>>>> >>>>>>> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225> >>>>>>> Implement join strategy hints >>>>>>> >>>>>>> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use >>>>>>> pandas DataFrame for struct type argument in Scalar Pandas UDF >>>>>>> >>>>>>> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix >>>>>>> deadlock between TaskMemoryManager and >>>>>>> UnsafeExternalSorter$SpillableIterator >>>>>>> >>>>>>> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> >>>>>>> Public APIs for extended Columnar Processing Support >>>>>>> >>>>>>> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589> >>>>>>> Re-implement file sources with data source V2 API >>>>>>> >>>>>>> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677> >>>>>>> Disk-persisted RDD blocks served by shuffle service, and ignored for >>>>>>> Dynamic Allocation >>>>>>> >>>>>>> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699> >>>>>>> Partially push down disjunctive predicated in Parquet/ORC >>>>>>> >>>>>>> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> >>>>>>> Port test cases from PostgreSQL to Spark SQL (ongoing) >>>>>>> >>>>>>> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884> >>>>>>> Deprecate Python 2 support >>>>>>> >>>>>>> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> >>>>>>> Convert applicable *.sql tests into UDF integrated test base >>>>>>> >>>>>>> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> >>>>>>> Allow dynamic allocation without an external shuffle service >>>>>>> >>>>>>> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> >>>>>>> Adjust post shuffle partition number in adaptive execution >>>>>>> >>>>>>> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> >>>>>>> Document Spark WEB UI >>>>>>> >>>>>>> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399> >>>>>>> RobustScaler feature transformer >>>>>>> >>>>>>> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> >>>>>>> Metadata Handling in Thrift Server >>>>>>> >>>>>>> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> >>>>>>> Build a SQL reference doc (ongoing) >>>>>>> >>>>>>> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> >>>>>>> Improve test coverage of ThriftServer >>>>>>> >>>>>>> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753> >>>>>>> Dynamically reuse subqueries in AQE >>>>>>> >>>>>>> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> >>>>>>> Remove outdated Experimental, Evolving annotations >>>>>>> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908> >>>>>>> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> >>>>>>> Remove deprecated items since <= 2.2.0 >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Xingbo >>>>>>> >>>>>>