Re: Spark 3.0 preview release feature list and major changes

Dongjoon Hyun Tue, 08 Oct 2019 22:20:09 -0700

Thank you for the preparation of 3.0-preview, Xingbo!

Bests,
Dongjoon.


On Tue, Oct 8, 2019 at 2:32 PM Xingbo Jiang <[email protected]> wrote:

>  What's the process to propose a feature to be included in the final Spark
>> 3.0 release?
>>
>
> I don't know whether there exists any specific process here, normally you
> just merge the feature into Spark master before release code freeze, and
> then the feature would probably be included in the release. The code freeze
> date for Spark 3.0 has not been decided yet, though.
>
> Li Jin <[email protected]> 于2019年10月8日周二 下午2:14写道：
>
>> Thanks for summary!
>>
>> I have a question that is semi-related - What's the process to propose a
>> feature to be included in the final Spark 3.0 release?
>>
>> In particular, I am interested in
>> https://issues.apache.org/jira/browse/SPARK-28006.  I am happy to do the
>> work so want to make sure I don't miss the "cut" date.
>>
>> On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang <[email protected]>
>> wrote:
>>
>>> Hi all,
>>>
>>> Thanks for all the feedbacks, here is the updated feature list:
>>>
>>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215>
>>> Multiple columns support added to various Transformers: StringIndexer
>>>
>>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150>
>>> Implement Dynamic Partition Pruning
>>>
>>> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> Support
>>> Tree-Based Feature Transformation
>>>
>>> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add
>>> MultilabelClassificationEvaluator
>>>
>>> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add
>>> sample weights to decision trees
>>>
>>> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> Pushing
>>> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>>>
>>> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API
>>> for Power Iteration Clustering
>>>
>>> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> Improve
>>> logic for timing out executors in dynamic allocation
>>>
>>> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636>
>>> Eliminate unnecessary shuffle with adjacent Window expressions
>>>
>>> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> Acquire
>>> new executors to avoid hang because of blacklisting
>>>
>>> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796>
>>> Multiple columns support added to various Transformers: PySpark
>>> QuantileDiscretizer
>>>
>>> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new
>>> approach to do adaptive execution in Spark SQL
>>>
>>> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> Apply
>>> custom log URL pattern for executor log URLs in SHS
>>>
>>> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add
>>> support for Kafka headers
>>>
>>> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add
>>> Spark ML Listener for Tracking ML Pipeline Status
>>>
>>> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> Upgrade
>>> the built-in Hive to 2.3.5 for hadoop-3.2
>>>
>>> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add fit
>>> with validation set to Gradient Boosted Trees: Python API
>>>
>>> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build
>>> and Run Spark on JDK11
>>>
>>> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615>
>>> Accelerator-aware task scheduling for Spark
>>>
>>> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow
>>> sharing Netty's memory pool allocators
>>>
>>> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix
>>> race condition with tasks running when new attempt for same stage is
>>> created leads to other task in the next attempt running on the same
>>> partition id retry multiple times
>>>
>>> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> Support
>>> rolling back a shuffle map stage and re-generate the shuffle files
>>>
>>> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data
>>> source for binary files
>>>
>>> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add
>>> kafka delegation token support
>>>
>>> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603>
>>> Generalize Nested Column Pruning
>>>
>>> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove
>>> support for Scala 2.11 in Spark 3.0.0
>>>
>>> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define
>>> reserved keywords after SQL standard
>>>
>>> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow
>>> Pandas UDF to take an iterator of pd.DataFrames
>>>
>>> SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow
>>> optimization in SparkR's interoperability
>>>
>>> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data
>>> source v2 API refactor: streaming write
>>>
>>> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848>
>>> Introduce new option to Kafka source: offset by timestamp (starting/ending)
>>>
>>> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> remove
>>> streaming output mode from data source v2 APIs
>>>
>>> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create
>>> StreamingWrite at the beginning of streaming execution
>>>
>>> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not
>>> infer schema when reading Hive serde table with native data source
>>>
>>> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225>
>>> Implement join strategy hints
>>>
>>> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use
>>> pandas DataFrame for struct type argument in Scalar Pandas UDF
>>>
>>> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix
>>> deadlock between TaskMemoryManager and
>>> UnsafeExternalSorter$SpillableIterator
>>>
>>> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public
>>> APIs for extended Columnar Processing Support
>>>
>>> SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> Support
>>> Dataframe Cogroup via Pandas UDFs
>>>
>>> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589>
>>> Re-implement file sources with data source V2 API
>>>
>>> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677>
>>> Disk-persisted RDD blocks served by shuffle service, and ignored for
>>> Dynamic Allocation
>>>
>>> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699>
>>> Partially push down disjunctive predicated in Parquet/ORC
>>>
>>> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port
>>> test cases from PostgreSQL to Spark SQL
>>>
>>> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884>
>>> Deprecate Python 2 support
>>>
>>> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> Convert
>>> applicable *.sql tests into UDF integrated test base
>>>
>>> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow
>>> dynamic allocation without an external shuffle service
>>>
>>> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust
>>> post shuffle partition number in adaptive execution
>>>
>>> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move
>>> Trigger implementations to Triggers.scala and avoid exposing these to the
>>> end users
>>>
>>> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372>
>>> Document Spark WEB UI
>>>
>>> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399>
>>> RobustScaler feature transformer
>>>
>>> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426>
>>> Metadata Handling in Thrift Server
>>>
>>> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build a
>>> SQL reference doc
>>>
>>> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> Improve
>>> test coverage of ThriftServer
>>>
>>> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753>
>>> Dynamically reuse subqueries in AQE
>>>
>>> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove
>>> outdated Experimental, Evolving annotations
>>> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908>
>>> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> Remove
>>> deprecated items since <= 2.2.0
>>>
>>> Cheers,
>>>
>>> Xingbo
>>>
>>> Hyukjin Kwon <[email protected]> 于2019年10月7日周一 下午9:29写道：
>>>
>>>> Cogroup Pandas UDF missing:
>>>>
>>>> SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463>
>>>> Support Dataframe Cogroup via Pandas UDFs
>>>> Vectorized R execution:
>>>>
>>>> SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow
>>>> optimization in SparkR's interoperability
>>>>
>>>>
>>>> 2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim <[email protected]>님이
>>>> 작성:
>>>>
>>>>> Thanks for bringing the nice summary of Spark 3.0 improvements!
>>>>>
>>>>> I'd like to add some items from structured streaming side,
>>>>>
>>>>> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move
>>>>> Trigger implementations to Triggers.scala and avoid exposing these to the
>>>>> end users (removal of deprecated)
>>>>> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add
>>>>> support for Kafka headers in Structured Streaming
>>>>> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add
>>>>> kafka delegation token support (there were follow-up issues to add
>>>>> functionalities like support multi clusters, etc.)
>>>>> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848>
>>>>> Introduce new option to Kafka source: offset by timestamp 
>>>>> (starting/ending)
>>>>> SPARK-28074 <https://issues.apache.org/jira/browse/SPARK-28074> Log
>>>>> warn message on possible correctness issue for multiple stateful 
>>>>> operations
>>>>> in single query
>>>>>
>>>>> and core side,
>>>>>
>>>>> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> New
>>>>> feature: apply custom log URL pattern for executor log URLs in SHS
>>>>> (follow-up issue expanded the functionality to Spark UI as well)
>>>>>
>>>>> FYI if we count on current work in progress, there's ongoing umbrella
>>>>> issue regarding rolling event log & snapshot (SPARK-28594
>>>>> <https://issues.apache.org/jira/browse/SPARK-28594>) which we
>>>>> struggle to get things done in Spark 3.0.
>>>>>
>>>>> Thanks,
>>>>> Jungtaek Lim (HeartSaVioR)
>>>>>
>>>>>
>>>>> On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I went over all the finished JIRA tickets targeted to Spark 3.0.0,
>>>>>> here I'm listing all the notable features and major changes that are 
>>>>>> ready
>>>>>> to test/deliver, please don't hesitate to add more to the list:
>>>>>>
>>>>>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215>
>>>>>> Multiple columns support added to various Transformers: StringIndexer
>>>>>>
>>>>>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150>
>>>>>> Implement Dynamic Partition Pruning
>>>>>>
>>>>>> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677>
>>>>>> Support Tree-Based Feature Transformation
>>>>>>
>>>>>> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add
>>>>>> MultilabelClassificationEvaluator
>>>>>>
>>>>>> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add
>>>>>> sample weights to decision trees
>>>>>>
>>>>>> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712>
>>>>>> Pushing Left Semi and Left Anti joins through Project, Aggregate, Window,
>>>>>> Union etc.
>>>>>>
>>>>>> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R
>>>>>> API for Power Iteration Clustering
>>>>>>
>>>>>> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286>
>>>>>> Improve logic for timing out executors in dynamic allocation
>>>>>>
>>>>>> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636>
>>>>>> Eliminate unnecessary shuffle with adjacent Window expressions
>>>>>>
>>>>>> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148>
>>>>>> Acquire new executors to avoid hang because of blacklisting
>>>>>>
>>>>>> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796>
>>>>>> Multiple columns support added to various Transformers: PySpark
>>>>>> QuantileDiscretizer
>>>>>>
>>>>>> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A
>>>>>> new approach to do adaptive execution in Spark SQL
>>>>>>
>>>>>> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add
>>>>>> Spark ML Listener for Tracking ML Pipeline Status
>>>>>>
>>>>>> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710>
>>>>>> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>>>>
>>>>>> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add
>>>>>> fit with validation set to Gradient Boosted Trees: Python API
>>>>>>
>>>>>> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417>
>>>>>> Build and Run Spark on JDK11
>>>>>>
>>>>>> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615>
>>>>>> Accelerator-aware task scheduling for Spark
>>>>>>
>>>>>> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920>
>>>>>> Allow sharing Netty's memory pool allocators
>>>>>>
>>>>>> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix
>>>>>> race condition with tasks running when new attempt for same stage is
>>>>>> created leads to other task in the next attempt running on the same
>>>>>> partition id retry multiple times
>>>>>>
>>>>>> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341>
>>>>>> Support rolling back a shuffle map stage and re-generate the shuffle 
>>>>>> files
>>>>>>
>>>>>> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data
>>>>>> source for binary files
>>>>>>
>>>>>> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603>
>>>>>> Generalize Nested Column Pruning
>>>>>>
>>>>>> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132>
>>>>>> Remove support for Scala 2.11 in Spark 3.0.0
>>>>>>
>>>>>> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215>
>>>>>> define reserved keywords after SQL standard
>>>>>>
>>>>>> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412>
>>>>>> Allow Pandas UDF to take an iterator of pd.DataFrames
>>>>>>
>>>>>> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data
>>>>>> source v2 API refactor: streaming write
>>>>>>
>>>>>> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956>
>>>>>> remove streaming output mode from data source v2 APIs
>>>>>>
>>>>>> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064>
>>>>>> create StreamingWrite at the beginning of streaming execution
>>>>>>
>>>>>> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do
>>>>>> not infer schema when reading Hive serde table with native data source
>>>>>>
>>>>>> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225>
>>>>>> Implement join strategy hints
>>>>>>
>>>>>> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use
>>>>>> pandas DataFrame for struct type argument in Scalar Pandas UDF
>>>>>>
>>>>>> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix
>>>>>> deadlock between TaskMemoryManager and
>>>>>> UnsafeExternalSorter$SpillableIterator
>>>>>>
>>>>>> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396>
>>>>>> Public APIs for extended Columnar Processing Support
>>>>>>
>>>>>> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589>
>>>>>> Re-implement file sources with data source V2 API
>>>>>>
>>>>>> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677>
>>>>>> Disk-persisted RDD blocks served by shuffle service, and ignored for
>>>>>> Dynamic Allocation
>>>>>>
>>>>>> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699>
>>>>>> Partially push down disjunctive predicated in Parquet/ORC
>>>>>>
>>>>>> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port
>>>>>> test cases from PostgreSQL to Spark SQL (ongoing)
>>>>>>
>>>>>> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884>
>>>>>> Deprecate Python 2 support
>>>>>>
>>>>>> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921>
>>>>>> Convert applicable *.sql tests into UDF integrated test base
>>>>>>
>>>>>> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963>
>>>>>> Allow dynamic allocation without an external shuffle service
>>>>>>
>>>>>> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177>
>>>>>> Adjust post shuffle partition number in adaptive execution
>>>>>>
>>>>>> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372>
>>>>>> Document Spark WEB UI
>>>>>>
>>>>>> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399>
>>>>>> RobustScaler feature transformer
>>>>>>
>>>>>> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426>
>>>>>> Metadata Handling in Thrift Server
>>>>>>
>>>>>> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588>
>>>>>> Build a SQL reference doc (ongoing)
>>>>>>
>>>>>> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608>
>>>>>> Improve test coverage of ThriftServer
>>>>>>
>>>>>> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753>
>>>>>> Dynamically reuse subqueries in AQE
>>>>>>
>>>>>> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855>
>>>>>> Remove outdated Experimental, Evolving annotations
>>>>>> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908>
>>>>>> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980>
>>>>>> Remove deprecated items since <= 2.2.0
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Xingbo
>>>>>>
>>>>>

Re: Spark 3.0 preview release feature list and major changes

Reply via email to