Re: Spark 3.0 preview release feature list and major changes

Li Jin Tue, 08 Oct 2019 14:14:30 -0700

Thanks for summary!

I have a question that is semi-related - What's the process to propose a
feature to be included in the final Spark 3.0 release?


In particular, I am interested in
https://issues.apache.org/jira/browse/SPARK-28006.  I am happy to do the
work so want to make sure I don't miss the "cut" date.

On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang <jiangxb1...@gmail.com> wrote:

> Hi all,
>
> Thanks for all the feedbacks, here is the updated feature list:
>
> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> Multiple
> columns support added to various Transformers: StringIndexer
>
> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150> Implement
> Dynamic Partition Pruning
>
> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> Support
> Tree-Based Feature Transformation
>
> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add
> MultilabelClassificationEvaluator
>
> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add
> sample weights to decision trees
>
> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> Pushing
> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>
> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API for
> Power Iteration Clustering
>
> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> Improve
> logic for timing out executors in dynamic allocation
>
> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636> Eliminate
> unnecessary shuffle with adjacent Window expressions
>
> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> Acquire
> new executors to avoid hang because of blacklisting
>
> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> Multiple
> columns support added to various Transformers: PySpark QuantileDiscretizer
>
> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new
> approach to do adaptive execution in Spark SQL
>
> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> Apply
> custom log URL pattern for executor log URLs in SHS
>
> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add
> support for Kafka headers
>
> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add Spark
> ML Listener for Tracking ML Pipeline Status
>
> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> Upgrade
> the built-in Hive to 2.3.5 for hadoop-3.2
>
> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add fit
> with validation set to Gradient Boosted Trees: Python API
>
> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build and
> Run Spark on JDK11
>
> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615>
> Accelerator-aware task scheduling for Spark
>
> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow
> sharing Netty's memory pool allocators
>
> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix race
> condition with tasks running when new attempt for same stage is created
> leads to other task in the next attempt running on the same partition id
> retry multiple times
>
> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> Support
> rolling back a shuffle map stage and re-generate the shuffle files
>
> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data
> source for binary files
>
> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add kafka
> delegation token support
>
> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603>
> Generalize Nested Column Pruning
>
> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove
> support for Scala 2.11 in Spark 3.0.0
>
> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define
> reserved keywords after SQL standard
>
> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow
> Pandas UDF to take an iterator of pd.DataFrames
>
> SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow
> optimization in SparkR's interoperability
>
> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data
> source v2 API refactor: streaming write
>
> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848> Introduce
> new option to Kafka source: offset by timestamp (starting/ending)
>
> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> remove
> streaming output mode from data source v2 APIs
>
> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create
> StreamingWrite at the beginning of streaming execution
>
> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not
> infer schema when reading Hive serde table with native data source
>
> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225> Implement
> join strategy hints
>
> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use
> pandas DataFrame for struct type argument in Scalar Pandas UDF
>
> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix
> deadlock between TaskMemoryManager and
> UnsafeExternalSorter$SpillableIterator
>
> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public
> APIs for extended Columnar Processing Support
>
> SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> Support
> Dataframe Cogroup via Pandas UDFs
>
> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589>
> Re-implement file sources with data source V2 API
>
> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677>
> Disk-persisted RDD blocks served by shuffle service, and ignored for
> Dynamic Allocation
>
> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699> Partially
> push down disjunctive predicated in Parquet/ORC
>
> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port test
> cases from PostgreSQL to Spark SQL
>
> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884> Deprecate
> Python 2 support
>
> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> Convert
> applicable *.sql tests into UDF integrated test base
>
> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow
> dynamic allocation without an external shuffle service
>
> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust
> post shuffle partition number in adaptive execution
>
> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move
> Trigger implementations to Triggers.scala and avoid exposing these to the
> end users
>
> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> Document
> Spark WEB UI
>
> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399>
> RobustScaler feature transformer
>
> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> Metadata
> Handling in Thrift Server
>
> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build a
> SQL reference doc
>
> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> Improve
> test coverage of ThriftServer
>
> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753>
> Dynamically reuse subqueries in AQE
>
> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove
> outdated Experimental, Evolving annotations
> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908>
> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> Remove
> deprecated items since <= 2.2.0
>
> Cheers,
>
> Xingbo
>
> Hyukjin Kwon <gurwls...@gmail.com> 于2019年10月7日周一 下午9:29写道：
>
>> Cogroup Pandas UDF missing:
>>
>> SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> Support
>> Dataframe Cogroup via Pandas UDFs
>> Vectorized R execution:
>>
>> SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow
>> optimization in SparkR's interoperability
>>
>>
>> 2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이
>> 작성:
>>
>>> Thanks for bringing the nice summary of Spark 3.0 improvements!
>>>
>>> I'd like to add some items from structured streaming side,
>>>
>>> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move
>>> Trigger implementations to Triggers.scala and avoid exposing these to the
>>> end users (removal of deprecated)
>>> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add
>>> support for Kafka headers in Structured Streaming
>>> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add
>>> kafka delegation token support (there were follow-up issues to add
>>> functionalities like support multi clusters, etc.)
>>> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848>
>>> Introduce new option to Kafka source: offset by timestamp (starting/ending)
>>> SPARK-28074 <https://issues.apache.org/jira/browse/SPARK-28074> Log
>>> warn message on possible correctness issue for multiple stateful operations
>>> in single query
>>>
>>> and core side,
>>>
>>> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> New
>>> feature: apply custom log URL pattern for executor log URLs in SHS
>>> (follow-up issue expanded the functionality to Spark UI as well)
>>>
>>> FYI if we count on current work in progress, there's ongoing umbrella
>>> issue regarding rolling event log & snapshot (SPARK-28594
>>> <https://issues.apache.org/jira/browse/SPARK-28594>) which we struggle
>>> to get things done in Spark 3.0.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>> On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <jiangxb1...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I went over all the finished JIRA tickets targeted to Spark 3.0.0, here
>>>> I'm listing all the notable features and major changes that are ready to
>>>> test/deliver, please don't hesitate to add more to the list:
>>>>
>>>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215>
>>>> Multiple columns support added to various Transformers: StringIndexer
>>>>
>>>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150>
>>>> Implement Dynamic Partition Pruning
>>>>
>>>> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677>
>>>> Support Tree-Based Feature Transformation
>>>>
>>>> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add
>>>> MultilabelClassificationEvaluator
>>>>
>>>> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add
>>>> sample weights to decision trees
>>>>
>>>> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712>
>>>> Pushing Left Semi and Left Anti joins through Project, Aggregate, Window,
>>>> Union etc.
>>>>
>>>> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API
>>>> for Power Iteration Clustering
>>>>
>>>> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286>
>>>> Improve logic for timing out executors in dynamic allocation
>>>>
>>>> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636>
>>>> Eliminate unnecessary shuffle with adjacent Window expressions
>>>>
>>>> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148>
>>>> Acquire new executors to avoid hang because of blacklisting
>>>>
>>>> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796>
>>>> Multiple columns support added to various Transformers: PySpark
>>>> QuantileDiscretizer
>>>>
>>>> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new
>>>> approach to do adaptive execution in Spark SQL
>>>>
>>>> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add
>>>> Spark ML Listener for Tracking ML Pipeline Status
>>>>
>>>> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710>
>>>> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>>
>>>> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add
>>>> fit with validation set to Gradient Boosted Trees: Python API
>>>>
>>>> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build
>>>> and Run Spark on JDK11
>>>>
>>>> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615>
>>>> Accelerator-aware task scheduling for Spark
>>>>
>>>> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow
>>>> sharing Netty's memory pool allocators
>>>>
>>>> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix
>>>> race condition with tasks running when new attempt for same stage is
>>>> created leads to other task in the next attempt running on the same
>>>> partition id retry multiple times
>>>>
>>>> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341>
>>>> Support rolling back a shuffle map stage and re-generate the shuffle files
>>>>
>>>> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data
>>>> source for binary files
>>>>
>>>> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603>
>>>> Generalize Nested Column Pruning
>>>>
>>>> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove
>>>> support for Scala 2.11 in Spark 3.0.0
>>>>
>>>> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define
>>>> reserved keywords after SQL standard
>>>>
>>>> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow
>>>> Pandas UDF to take an iterator of pd.DataFrames
>>>>
>>>> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data
>>>> source v2 API refactor: streaming write
>>>>
>>>> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> remove
>>>> streaming output mode from data source v2 APIs
>>>>
>>>> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create
>>>> StreamingWrite at the beginning of streaming execution
>>>>
>>>> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not
>>>> infer schema when reading Hive serde table with native data source
>>>>
>>>> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225>
>>>> Implement join strategy hints
>>>>
>>>> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use
>>>> pandas DataFrame for struct type argument in Scalar Pandas UDF
>>>>
>>>> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix
>>>> deadlock between TaskMemoryManager and
>>>> UnsafeExternalSorter$SpillableIterator
>>>>
>>>> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public
>>>> APIs for extended Columnar Processing Support
>>>>
>>>> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589>
>>>> Re-implement file sources with data source V2 API
>>>>
>>>> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677>
>>>> Disk-persisted RDD blocks served by shuffle service, and ignored for
>>>> Dynamic Allocation
>>>>
>>>> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699>
>>>> Partially push down disjunctive predicated in Parquet/ORC
>>>>
>>>> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port
>>>> test cases from PostgreSQL to Spark SQL (ongoing)
>>>>
>>>> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884>
>>>> Deprecate Python 2 support
>>>>
>>>> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921>
>>>> Convert applicable *.sql tests into UDF integrated test base
>>>>
>>>> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow
>>>> dynamic allocation without an external shuffle service
>>>>
>>>> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust
>>>> post shuffle partition number in adaptive execution
>>>>
>>>> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372>
>>>> Document Spark WEB UI
>>>>
>>>> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399>
>>>> RobustScaler feature transformer
>>>>
>>>> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426>
>>>> Metadata Handling in Thrift Server
>>>>
>>>> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build
>>>> a SQL reference doc (ongoing)
>>>>
>>>> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608>
>>>> Improve test coverage of ThriftServer
>>>>
>>>> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753>
>>>> Dynamically reuse subqueries in AQE
>>>>
>>>> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove
>>>> outdated Experimental, Evolving annotations
>>>> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908>
>>>> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> Remove
>>>> deprecated items since <= 2.2.0
>>>>
>>>> Cheers,
>>>>
>>>> Xingbo
>>>>
>>>

Re: Spark 3.0 preview release feature list and major changes

Reply via email to