SPARK-29345 <https://issues.apache.org/jira/browse/SPARK-29345> Add an API that allows a user to define and observe arbitrary metrics on streaming queries
Let us add this too. Cheers, Xiao On Tue, Oct 8, 2019 at 10:31 PM Wenchen Fan <cloud0...@gmail.com> wrote: > Regarding DS v2, I'd like to remove > SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data > source v2 API refactor: streaming write > SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> remove > streaming output mode from data source v2 APIs > > and put the umbrella ticket instead > SPARK-25390 <https://issues.apache.org/jira/browse/SPARK-25390> data > source V2 API refactoring > > Thanks, > Wenchen > > On Wed, Oct 9, 2019 at 1:19 PM Dongjoon Hyun <dongjoon.h...@gmail.com> > wrote: > >> Thank you for the preparation of 3.0-preview, Xingbo! >> >> Bests, >> Dongjoon. >> >> On Tue, Oct 8, 2019 at 2:32 PM Xingbo Jiang <jiangxb1...@gmail.com> >> wrote: >> >>> What's the process to propose a feature to be included in the final >>>> Spark 3.0 release? >>>> >>> >>> I don't know whether there exists any specific process here, normally >>> you just merge the feature into Spark master before release code freeze, >>> and then the feature would probably be included in the release. The code >>> freeze date for Spark 3.0 has not been decided yet, though. >>> >>> Li Jin <ice.xell...@gmail.com> 于2019年10月8日周二 下午2:14写道: >>> >>>> Thanks for summary! >>>> >>>> I have a question that is semi-related - What's the process to propose >>>> a feature to be included in the final Spark 3.0 release? >>>> >>>> In particular, I am interested in >>>> https://issues.apache.org/jira/browse/SPARK-28006. I am happy to do >>>> the work so want to make sure I don't miss the "cut" date. >>>> >>>> On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang <jiangxb1...@gmail.com> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> Thanks for all the feedbacks, here is the updated feature list: >>>>> >>>>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> >>>>> Multiple columns support added to various Transformers: StringIndexer >>>>> >>>>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150> >>>>> Implement Dynamic Partition Pruning >>>>> >>>>> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> >>>>> Support Tree-Based Feature Transformation >>>>> >>>>> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add >>>>> MultilabelClassificationEvaluator >>>>> >>>>> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add >>>>> sample weights to decision trees >>>>> >>>>> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> >>>>> Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, >>>>> Union etc. >>>>> >>>>> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API >>>>> for Power Iteration Clustering >>>>> >>>>> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> >>>>> Improve logic for timing out executors in dynamic allocation >>>>> >>>>> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636> >>>>> Eliminate unnecessary shuffle with adjacent Window expressions >>>>> >>>>> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> >>>>> Acquire new executors to avoid hang because of blacklisting >>>>> >>>>> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> >>>>> Multiple columns support added to various Transformers: PySpark >>>>> QuantileDiscretizer >>>>> >>>>> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new >>>>> approach to do adaptive execution in Spark SQL >>>>> >>>>> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> Apply >>>>> custom log URL pattern for executor log URLs in SHS >>>>> >>>>> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add >>>>> support for Kafka headers >>>>> >>>>> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add >>>>> Spark ML Listener for Tracking ML Pipeline Status >>>>> >>>>> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> >>>>> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 >>>>> >>>>> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add >>>>> fit with validation set to Gradient Boosted Trees: Python API >>>>> >>>>> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build >>>>> and Run Spark on JDK11 >>>>> >>>>> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615> >>>>> Accelerator-aware task scheduling for Spark >>>>> >>>>> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow >>>>> sharing Netty's memory pool allocators >>>>> >>>>> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix >>>>> race condition with tasks running when new attempt for same stage is >>>>> created leads to other task in the next attempt running on the same >>>>> partition id retry multiple times >>>>> >>>>> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> >>>>> Support rolling back a shuffle map stage and re-generate the shuffle files >>>>> >>>>> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data >>>>> source for binary files >>>>> >>>>> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add >>>>> kafka delegation token support >>>>> >>>>> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603> >>>>> Generalize Nested Column Pruning >>>>> >>>>> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> >>>>> Remove support for Scala 2.11 in Spark 3.0.0 >>>>> >>>>> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> >>>>> define reserved keywords after SQL standard >>>>> >>>>> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow >>>>> Pandas UDF to take an iterator of pd.DataFrames >>>>> >>>>> SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow >>>>> optimization in SparkR's interoperability >>>>> >>>>> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data >>>>> source v2 API refactor: streaming write >>>>> >>>>> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848> >>>>> Introduce new option to Kafka source: offset by timestamp >>>>> (starting/ending) >>>>> >>>>> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> >>>>> remove streaming output mode from data source v2 APIs >>>>> >>>>> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> >>>>> create StreamingWrite at the beginning of streaming execution >>>>> >>>>> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do >>>>> not infer schema when reading Hive serde table with native data source >>>>> >>>>> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225> >>>>> Implement join strategy hints >>>>> >>>>> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use >>>>> pandas DataFrame for struct type argument in Scalar Pandas UDF >>>>> >>>>> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix >>>>> deadlock between TaskMemoryManager and >>>>> UnsafeExternalSorter$SpillableIterator >>>>> >>>>> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> >>>>> Public APIs for extended Columnar Processing Support >>>>> >>>>> SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> >>>>> Support Dataframe Cogroup via Pandas UDFs >>>>> >>>>> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589> >>>>> Re-implement file sources with data source V2 API >>>>> >>>>> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677> >>>>> Disk-persisted RDD blocks served by shuffle service, and ignored for >>>>> Dynamic Allocation >>>>> >>>>> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699> >>>>> Partially push down disjunctive predicated in Parquet/ORC >>>>> >>>>> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port >>>>> test cases from PostgreSQL to Spark SQL >>>>> >>>>> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884> >>>>> Deprecate Python 2 support >>>>> >>>>> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> >>>>> Convert applicable *.sql tests into UDF integrated test base >>>>> >>>>> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow >>>>> dynamic allocation without an external shuffle service >>>>> >>>>> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> >>>>> Adjust post shuffle partition number in adaptive execution >>>>> >>>>> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move >>>>> Trigger implementations to Triggers.scala and avoid exposing these to the >>>>> end users >>>>> >>>>> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> >>>>> Document Spark WEB UI >>>>> >>>>> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399> >>>>> RobustScaler feature transformer >>>>> >>>>> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> >>>>> Metadata Handling in Thrift Server >>>>> >>>>> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build >>>>> a SQL reference doc >>>>> >>>>> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> >>>>> Improve test coverage of ThriftServer >>>>> >>>>> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753> >>>>> Dynamically reuse subqueries in AQE >>>>> >>>>> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> >>>>> Remove outdated Experimental, Evolving annotations >>>>> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908> >>>>> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> >>>>> Remove deprecated items since <= 2.2.0 >>>>> >>>>> Cheers, >>>>> >>>>> Xingbo >>>>> >>>>> Hyukjin Kwon <gurwls...@gmail.com> 于2019年10月7日周一 下午9:29写道: >>>>> >>>>>> Cogroup Pandas UDF missing: >>>>>> >>>>>> SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> >>>>>> Support Dataframe Cogroup via Pandas UDFs >>>>>> Vectorized R execution: >>>>>> >>>>>> SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow >>>>>> optimization in SparkR's interoperability >>>>>> >>>>>> >>>>>> 2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 >>>>>> 작성: >>>>>> >>>>>>> Thanks for bringing the nice summary of Spark 3.0 improvements! >>>>>>> >>>>>>> I'd like to add some items from structured streaming side, >>>>>>> >>>>>>> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move >>>>>>> Trigger implementations to Triggers.scala and avoid exposing these to >>>>>>> the >>>>>>> end users (removal of deprecated) >>>>>>> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add >>>>>>> support for Kafka headers in Structured Streaming >>>>>>> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add >>>>>>> kafka delegation token support (there were follow-up issues to add >>>>>>> functionalities like support multi clusters, etc.) >>>>>>> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848> >>>>>>> Introduce new option to Kafka source: offset by timestamp >>>>>>> (starting/ending) >>>>>>> SPARK-28074 <https://issues.apache.org/jira/browse/SPARK-28074> Log >>>>>>> warn message on possible correctness issue for multiple stateful >>>>>>> operations >>>>>>> in single query >>>>>>> >>>>>>> and core side, >>>>>>> >>>>>>> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> New >>>>>>> feature: apply custom log URL pattern for executor log URLs in SHS >>>>>>> (follow-up issue expanded the functionality to Spark UI as well) >>>>>>> >>>>>>> FYI if we count on current work in progress, there's ongoing >>>>>>> umbrella issue regarding rolling event log & snapshot (SPARK-28594 >>>>>>> <https://issues.apache.org/jira/browse/SPARK-28594>) which we >>>>>>> struggle to get things done in Spark 3.0. >>>>>>> >>>>>>> Thanks, >>>>>>> Jungtaek Lim (HeartSaVioR) >>>>>>> >>>>>>> >>>>>>> On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <jiangxb1...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I went over all the finished JIRA tickets targeted to Spark 3.0.0, >>>>>>>> here I'm listing all the notable features and major changes that are >>>>>>>> ready >>>>>>>> to test/deliver, please don't hesitate to add more to the list: >>>>>>>> >>>>>>>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> >>>>>>>> Multiple columns support added to various Transformers: StringIndexer >>>>>>>> >>>>>>>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150> >>>>>>>> Implement Dynamic Partition Pruning >>>>>>>> >>>>>>>> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> >>>>>>>> Support Tree-Based Feature Transformation >>>>>>>> >>>>>>>> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> >>>>>>>> Add MultilabelClassificationEvaluator >>>>>>>> >>>>>>>> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> >>>>>>>> Add sample weights to decision trees >>>>>>>> >>>>>>>> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> >>>>>>>> Pushing Left Semi and Left Anti joins through Project, Aggregate, >>>>>>>> Window, >>>>>>>> Union etc. >>>>>>>> >>>>>>>> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R >>>>>>>> API for Power Iteration Clustering >>>>>>>> >>>>>>>> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> >>>>>>>> Improve logic for timing out executors in dynamic allocation >>>>>>>> >>>>>>>> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636> >>>>>>>> Eliminate unnecessary shuffle with adjacent Window expressions >>>>>>>> >>>>>>>> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> >>>>>>>> Acquire new executors to avoid hang because of blacklisting >>>>>>>> >>>>>>>> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> >>>>>>>> Multiple columns support added to various Transformers: PySpark >>>>>>>> QuantileDiscretizer >>>>>>>> >>>>>>>> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A >>>>>>>> new approach to do adaptive execution in Spark SQL >>>>>>>> >>>>>>>> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> >>>>>>>> Add Spark ML Listener for Tracking ML Pipeline Status >>>>>>>> >>>>>>>> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> >>>>>>>> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2 >>>>>>>> >>>>>>>> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> >>>>>>>> Add fit with validation set to Gradient Boosted Trees: Python API >>>>>>>> >>>>>>>> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> >>>>>>>> Build and Run Spark on JDK11 >>>>>>>> >>>>>>>> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615> >>>>>>>> Accelerator-aware task scheduling for Spark >>>>>>>> >>>>>>>> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> >>>>>>>> Allow sharing Netty's memory pool allocators >>>>>>>> >>>>>>>> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> >>>>>>>> Fix race condition with tasks running when new attempt for same stage >>>>>>>> is >>>>>>>> created leads to other task in the next attempt running on the same >>>>>>>> partition id retry multiple times >>>>>>>> >>>>>>>> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> >>>>>>>> Support rolling back a shuffle map stage and re-generate the shuffle >>>>>>>> files >>>>>>>> >>>>>>>> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> >>>>>>>> Data source for binary files >>>>>>>> >>>>>>>> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603> >>>>>>>> Generalize Nested Column Pruning >>>>>>>> >>>>>>>> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> >>>>>>>> Remove support for Scala 2.11 in Spark 3.0.0 >>>>>>>> >>>>>>>> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> >>>>>>>> define reserved keywords after SQL standard >>>>>>>> >>>>>>>> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> >>>>>>>> Allow Pandas UDF to take an iterator of pd.DataFrames >>>>>>>> >>>>>>>> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> >>>>>>>> data source v2 API refactor: streaming write >>>>>>>> >>>>>>>> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> >>>>>>>> remove streaming output mode from data source v2 APIs >>>>>>>> >>>>>>>> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> >>>>>>>> create StreamingWrite at the beginning of streaming execution >>>>>>>> >>>>>>>> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do >>>>>>>> not infer schema when reading Hive serde table with native data source >>>>>>>> >>>>>>>> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225> >>>>>>>> Implement join strategy hints >>>>>>>> >>>>>>>> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> >>>>>>>> Use pandas DataFrame for struct type argument in Scalar Pandas UDF >>>>>>>> >>>>>>>> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> >>>>>>>> Fix deadlock between TaskMemoryManager and >>>>>>>> UnsafeExternalSorter$SpillableIterator >>>>>>>> >>>>>>>> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> >>>>>>>> Public APIs for extended Columnar Processing Support >>>>>>>> >>>>>>>> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589> >>>>>>>> Re-implement file sources with data source V2 API >>>>>>>> >>>>>>>> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677> >>>>>>>> Disk-persisted RDD blocks served by shuffle service, and ignored for >>>>>>>> Dynamic Allocation >>>>>>>> >>>>>>>> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699> >>>>>>>> Partially push down disjunctive predicated in Parquet/ORC >>>>>>>> >>>>>>>> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> >>>>>>>> Port test cases from PostgreSQL to Spark SQL (ongoing) >>>>>>>> >>>>>>>> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884> >>>>>>>> Deprecate Python 2 support >>>>>>>> >>>>>>>> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> >>>>>>>> Convert applicable *.sql tests into UDF integrated test base >>>>>>>> >>>>>>>> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> >>>>>>>> Allow dynamic allocation without an external shuffle service >>>>>>>> >>>>>>>> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> >>>>>>>> Adjust post shuffle partition number in adaptive execution >>>>>>>> >>>>>>>> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> >>>>>>>> Document Spark WEB UI >>>>>>>> >>>>>>>> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399> >>>>>>>> RobustScaler feature transformer >>>>>>>> >>>>>>>> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> >>>>>>>> Metadata Handling in Thrift Server >>>>>>>> >>>>>>>> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> >>>>>>>> Build a SQL reference doc (ongoing) >>>>>>>> >>>>>>>> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> >>>>>>>> Improve test coverage of ThriftServer >>>>>>>> >>>>>>>> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753> >>>>>>>> Dynamically reuse subqueries in AQE >>>>>>>> >>>>>>>> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> >>>>>>>> Remove outdated Experimental, Evolving annotations >>>>>>>> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908> >>>>>>>> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> >>>>>>>> Remove deprecated items since <= 2.2.0 >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Xingbo >>>>>>>> >>>>>>> -- [image: Databricks Summit - Watch the talks] <https://databricks.com/sparkaisummit/north-america>