Re: Spark 3.0 preview release feature list and major changes

Wenchen Fan Tue, 08 Oct 2019 22:32:17 -0700

Regarding DS v2, I'd like to remove
SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data source
v2 API refactor: streaming write
SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> remove
streaming output mode from data source v2 APIs


and put the umbrella ticket instead
SPARK-25390 <https://issues.apache.org/jira/browse/SPARK-25390> data source
V2 API refactoring

Thanks,
Wenchen

On Wed, Oct 9, 2019 at 1:19 PM Dongjoon Hyun <dongjoon.h...@gmail.com>
wrote:

> Thank you for the preparation of 3.0-preview, Xingbo!
>
> Bests,
> Dongjoon.
>
> On Tue, Oct 8, 2019 at 2:32 PM Xingbo Jiang <jiangxb1...@gmail.com> wrote:
>
>>  What's the process to propose a feature to be included in the final
>>> Spark 3.0 release?
>>>
>>
>> I don't know whether there exists any specific process here, normally you
>> just merge the feature into Spark master before release code freeze, and
>> then the feature would probably be included in the release. The code freeze
>> date for Spark 3.0 has not been decided yet, though.
>>
>> Li Jin <ice.xell...@gmail.com> 于2019年10月8日周二 下午2:14写道：
>>
>>> Thanks for summary!
>>>
>>> I have a question that is semi-related - What's the process to propose a
>>> feature to be included in the final Spark 3.0 release?
>>>
>>> In particular, I am interested in
>>> https://issues.apache.org/jira/browse/SPARK-28006.  I am happy to do
>>> the work so want to make sure I don't miss the "cut" date.
>>>
>>> On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang <jiangxb1...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Thanks for all the feedbacks, here is the updated feature list:
>>>>
>>>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215>
>>>> Multiple columns support added to various Transformers: StringIndexer
>>>>
>>>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150>
>>>> Implement Dynamic Partition Pruning
>>>>
>>>> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677>
>>>> Support Tree-Based Feature Transformation
>>>>
>>>> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add
>>>> MultilabelClassificationEvaluator
>>>>
>>>> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add
>>>> sample weights to decision trees
>>>>
>>>> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712>
>>>> Pushing Left Semi and Left Anti joins through Project, Aggregate, Window,
>>>> Union etc.
>>>>
>>>> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API
>>>> for Power Iteration Clustering
>>>>
>>>> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286>
>>>> Improve logic for timing out executors in dynamic allocation
>>>>
>>>> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636>
>>>> Eliminate unnecessary shuffle with adjacent Window expressions
>>>>
>>>> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148>
>>>> Acquire new executors to avoid hang because of blacklisting
>>>>
>>>> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796>
>>>> Multiple columns support added to various Transformers: PySpark
>>>> QuantileDiscretizer
>>>>
>>>> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new
>>>> approach to do adaptive execution in Spark SQL
>>>>
>>>> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> Apply
>>>> custom log URL pattern for executor log URLs in SHS
>>>>
>>>> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add
>>>> support for Kafka headers
>>>>
>>>> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add
>>>> Spark ML Listener for Tracking ML Pipeline Status
>>>>
>>>> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710>
>>>> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>>
>>>> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add
>>>> fit with validation set to Gradient Boosted Trees: Python API
>>>>
>>>> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build
>>>> and Run Spark on JDK11
>>>>
>>>> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615>
>>>> Accelerator-aware task scheduling for Spark
>>>>
>>>> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow
>>>> sharing Netty's memory pool allocators
>>>>
>>>> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix
>>>> race condition with tasks running when new attempt for same stage is
>>>> created leads to other task in the next attempt running on the same
>>>> partition id retry multiple times
>>>>
>>>> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341>
>>>> Support rolling back a shuffle map stage and re-generate the shuffle files
>>>>
>>>> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data
>>>> source for binary files
>>>>
>>>> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add
>>>> kafka delegation token support
>>>>
>>>> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603>
>>>> Generalize Nested Column Pruning
>>>>
>>>> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove
>>>> support for Scala 2.11 in Spark 3.0.0
>>>>
>>>> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define
>>>> reserved keywords after SQL standard
>>>>
>>>> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow
>>>> Pandas UDF to take an iterator of pd.DataFrames
>>>>
>>>> SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow
>>>> optimization in SparkR's interoperability
>>>>
>>>> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data
>>>> source v2 API refactor: streaming write
>>>>
>>>> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848>
>>>> Introduce new option to Kafka source: offset by timestamp (starting/ending)
>>>>
>>>> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> remove
>>>> streaming output mode from data source v2 APIs
>>>>
>>>> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create
>>>> StreamingWrite at the beginning of streaming execution
>>>>
>>>> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not
>>>> infer schema when reading Hive serde table with native data source
>>>>
>>>> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225>
>>>> Implement join strategy hints
>>>>
>>>> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use
>>>> pandas DataFrame for struct type argument in Scalar Pandas UDF
>>>>
>>>> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix
>>>> deadlock between TaskMemoryManager and
>>>> UnsafeExternalSorter$SpillableIterator
>>>>
>>>> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public
>>>> APIs for extended Columnar Processing Support
>>>>
>>>> SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463>
>>>> Support Dataframe Cogroup via Pandas UDFs
>>>>
>>>> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589>
>>>> Re-implement file sources with data source V2 API
>>>>
>>>> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677>
>>>> Disk-persisted RDD blocks served by shuffle service, and ignored for
>>>> Dynamic Allocation
>>>>
>>>> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699>
>>>> Partially push down disjunctive predicated in Parquet/ORC
>>>>
>>>> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port
>>>> test cases from PostgreSQL to Spark SQL
>>>>
>>>> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884>
>>>> Deprecate Python 2 support
>>>>
>>>> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921>
>>>> Convert applicable *.sql tests into UDF integrated test base
>>>>
>>>> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow
>>>> dynamic allocation without an external shuffle service
>>>>
>>>> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust
>>>> post shuffle partition number in adaptive execution
>>>>
>>>> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move
>>>> Trigger implementations to Triggers.scala and avoid exposing these to the
>>>> end users
>>>>
>>>> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372>
>>>> Document Spark WEB UI
>>>>
>>>> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399>
>>>> RobustScaler feature transformer
>>>>
>>>> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426>
>>>> Metadata Handling in Thrift Server
>>>>
>>>> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build
>>>> a SQL reference doc
>>>>
>>>> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608>
>>>> Improve test coverage of ThriftServer
>>>>
>>>> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753>
>>>> Dynamically reuse subqueries in AQE
>>>>
>>>> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove
>>>> outdated Experimental, Evolving annotations
>>>> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908>
>>>> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> Remove
>>>> deprecated items since <= 2.2.0
>>>>
>>>> Cheers,
>>>>
>>>> Xingbo
>>>>
>>>> Hyukjin Kwon <gurwls...@gmail.com> 于2019年10月7日周一 下午9:29写道：
>>>>
>>>>> Cogroup Pandas UDF missing:
>>>>>
>>>>> SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463>
>>>>> Support Dataframe Cogroup via Pandas UDFs
>>>>> Vectorized R execution:
>>>>>
>>>>> SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow
>>>>> optimization in SparkR's interoperability
>>>>>
>>>>>
>>>>> 2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이
>>>>> 작성:
>>>>>
>>>>>> Thanks for bringing the nice summary of Spark 3.0 improvements!
>>>>>>
>>>>>> I'd like to add some items from structured streaming side,
>>>>>>
>>>>>> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move
>>>>>> Trigger implementations to Triggers.scala and avoid exposing these to the
>>>>>> end users (removal of deprecated)
>>>>>> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add
>>>>>> support for Kafka headers in Structured Streaming
>>>>>> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add
>>>>>> kafka delegation token support (there were follow-up issues to add
>>>>>> functionalities like support multi clusters, etc.)
>>>>>> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848>
>>>>>> Introduce new option to Kafka source: offset by timestamp 
>>>>>> (starting/ending)
>>>>>> SPARK-28074 <https://issues.apache.org/jira/browse/SPARK-28074> Log
>>>>>> warn message on possible correctness issue for multiple stateful 
>>>>>> operations
>>>>>> in single query
>>>>>>
>>>>>> and core side,
>>>>>>
>>>>>> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> New
>>>>>> feature: apply custom log URL pattern for executor log URLs in SHS
>>>>>> (follow-up issue expanded the functionality to Spark UI as well)
>>>>>>
>>>>>> FYI if we count on current work in progress, there's ongoing umbrella
>>>>>> issue regarding rolling event log & snapshot (SPARK-28594
>>>>>> <https://issues.apache.org/jira/browse/SPARK-28594>) which we
>>>>>> struggle to get things done in Spark 3.0.
>>>>>>
>>>>>> Thanks,
>>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <jiangxb1...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I went over all the finished JIRA tickets targeted to Spark 3.0.0,
>>>>>>> here I'm listing all the notable features and major changes that are 
>>>>>>> ready
>>>>>>> to test/deliver, please don't hesitate to add more to the list:
>>>>>>>
>>>>>>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215>
>>>>>>> Multiple columns support added to various Transformers: StringIndexer
>>>>>>>
>>>>>>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150>
>>>>>>> Implement Dynamic Partition Pruning
>>>>>>>
>>>>>>> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677>
>>>>>>> Support Tree-Based Feature Transformation
>>>>>>>
>>>>>>> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add
>>>>>>> MultilabelClassificationEvaluator
>>>>>>>
>>>>>>> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add
>>>>>>> sample weights to decision trees
>>>>>>>
>>>>>>> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712>
>>>>>>> Pushing Left Semi and Left Anti joins through Project, Aggregate, 
>>>>>>> Window,
>>>>>>> Union etc.
>>>>>>>
>>>>>>> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R
>>>>>>> API for Power Iteration Clustering
>>>>>>>
>>>>>>> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286>
>>>>>>> Improve logic for timing out executors in dynamic allocation
>>>>>>>
>>>>>>> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636>
>>>>>>> Eliminate unnecessary shuffle with adjacent Window expressions
>>>>>>>
>>>>>>> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148>
>>>>>>> Acquire new executors to avoid hang because of blacklisting
>>>>>>>
>>>>>>> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796>
>>>>>>> Multiple columns support added to various Transformers: PySpark
>>>>>>> QuantileDiscretizer
>>>>>>>
>>>>>>> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A
>>>>>>> new approach to do adaptive execution in Spark SQL
>>>>>>>
>>>>>>> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add
>>>>>>> Spark ML Listener for Tracking ML Pipeline Status
>>>>>>>
>>>>>>> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710>
>>>>>>> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>>>>>
>>>>>>> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add
>>>>>>> fit with validation set to Gradient Boosted Trees: Python API
>>>>>>>
>>>>>>> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417>
>>>>>>> Build and Run Spark on JDK11
>>>>>>>
>>>>>>> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615>
>>>>>>> Accelerator-aware task scheduling for Spark
>>>>>>>
>>>>>>> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920>
>>>>>>> Allow sharing Netty's memory pool allocators
>>>>>>>
>>>>>>> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix
>>>>>>> race condition with tasks running when new attempt for same stage is
>>>>>>> created leads to other task in the next attempt running on the same
>>>>>>> partition id retry multiple times
>>>>>>>
>>>>>>> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341>
>>>>>>> Support rolling back a shuffle map stage and re-generate the shuffle 
>>>>>>> files
>>>>>>>
>>>>>>> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348>
>>>>>>> Data source for binary files
>>>>>>>
>>>>>>> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603>
>>>>>>> Generalize Nested Column Pruning
>>>>>>>
>>>>>>> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132>
>>>>>>> Remove support for Scala 2.11 in Spark 3.0.0
>>>>>>>
>>>>>>> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215>
>>>>>>> define reserved keywords after SQL standard
>>>>>>>
>>>>>>> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412>
>>>>>>> Allow Pandas UDF to take an iterator of pd.DataFrames
>>>>>>>
>>>>>>> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785>
>>>>>>> data source v2 API refactor: streaming write
>>>>>>>
>>>>>>> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956>
>>>>>>> remove streaming output mode from data source v2 APIs
>>>>>>>
>>>>>>> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064>
>>>>>>> create StreamingWrite at the beginning of streaming execution
>>>>>>>
>>>>>>> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do
>>>>>>> not infer schema when reading Hive serde table with native data source
>>>>>>>
>>>>>>> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225>
>>>>>>> Implement join strategy hints
>>>>>>>
>>>>>>> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use
>>>>>>> pandas DataFrame for struct type argument in Scalar Pandas UDF
>>>>>>>
>>>>>>> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix
>>>>>>> deadlock between TaskMemoryManager and
>>>>>>> UnsafeExternalSorter$SpillableIterator
>>>>>>>
>>>>>>> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396>
>>>>>>> Public APIs for extended Columnar Processing Support
>>>>>>>
>>>>>>> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589>
>>>>>>> Re-implement file sources with data source V2 API
>>>>>>>
>>>>>>> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677>
>>>>>>> Disk-persisted RDD blocks served by shuffle service, and ignored for
>>>>>>> Dynamic Allocation
>>>>>>>
>>>>>>> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699>
>>>>>>> Partially push down disjunctive predicated in Parquet/ORC
>>>>>>>
>>>>>>> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763>
>>>>>>> Port test cases from PostgreSQL to Spark SQL (ongoing)
>>>>>>>
>>>>>>> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884>
>>>>>>> Deprecate Python 2 support
>>>>>>>
>>>>>>> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921>
>>>>>>> Convert applicable *.sql tests into UDF integrated test base
>>>>>>>
>>>>>>> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963>
>>>>>>> Allow dynamic allocation without an external shuffle service
>>>>>>>
>>>>>>> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177>
>>>>>>> Adjust post shuffle partition number in adaptive execution
>>>>>>>
>>>>>>> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372>
>>>>>>> Document Spark WEB UI
>>>>>>>
>>>>>>> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399>
>>>>>>> RobustScaler feature transformer
>>>>>>>
>>>>>>> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426>
>>>>>>> Metadata Handling in Thrift Server
>>>>>>>
>>>>>>> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588>
>>>>>>> Build a SQL reference doc (ongoing)
>>>>>>>
>>>>>>> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608>
>>>>>>> Improve test coverage of ThriftServer
>>>>>>>
>>>>>>> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753>
>>>>>>> Dynamically reuse subqueries in AQE
>>>>>>>
>>>>>>> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855>
>>>>>>> Remove outdated Experimental, Evolving annotations
>>>>>>> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908>
>>>>>>> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980>
>>>>>>> Remove deprecated items since <= 2.2.0
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Xingbo
>>>>>>>
>>>>>>

Re: Spark 3.0 preview release feature list and major changes

Reply via email to