Re: Spark 3.0 preview release feature list and major changes

Xingbo Jiang Tue, 08 Oct 2019 13:53:23 -0700

Hi all,

Thanks for all the feedbacks, here is the updated feature list:


SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> Multiple
columns support added to various Transformers: StringIndexer

SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150> Implement
Dynamic Partition Pruning

SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> Support
Tree-Based Feature Transformation

SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add
MultilabelClassificationEvaluator

SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add sample
weights to decision trees

SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> Pushing
Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API for
Power Iteration Clustering

SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> Improve
logic for timing out executors in dynamic allocation

SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636> Eliminate
unnecessary shuffle with adjacent Window expressions

SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> Acquire new
executors to avoid hang because of blacklisting

SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> Multiple
columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new
approach to do adaptive execution in Spark SQL

SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> Apply
custom log URL pattern for executor log URLs in SHS

SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add support
for Kafka headers

SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add Spark
ML Listener for Tracking ML Pipeline Status

SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> Upgrade the
built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add fit
with validation set to Gradient Boosted Trees: Python API

SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build and
Run Spark on JDK11

SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615>
Accelerator-aware task scheduling for Spark

SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow
sharing Netty's memory pool allocators

SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix race
condition with tasks running when new attempt for same stage is created
leads to other task in the next attempt running on the same partition id
retry multiple times

SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> Support
rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data source
for binary files

SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add kafka
delegation token support

SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603> Generalize
Nested Column Pruning

SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove
support for Scala 2.11 in Spark 3.0.0

SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define
reserved keywords after SQL standard

SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow
Pandas UDF to take an iterator of pd.DataFrames

SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow
optimization in SparkR's interoperability

SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data source
v2 API refactor: streaming write

SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848> Introduce
new option to Kafka source: offset by timestamp (starting/ending)

SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> remove
streaming output mode from data source v2 APIs

SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create
StreamingWrite at the beginning of streaming execution

SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not
infer schema when reading Hive serde table with native data source

SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225> Implement
join strategy hints

SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use pandas
DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix
deadlock between TaskMemoryManager and
UnsafeExternalSorter$SpillableIterator

SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public APIs
for extended Columnar Processing Support

SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> Support
Dataframe Cogroup via Pandas UDFs

SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589>
Re-implement file sources with data source V2 API

SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677>
Disk-persisted RDD blocks served by shuffle service, and ignored for
Dynamic Allocation

SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699> Partially
push down disjunctive predicated in Parquet/ORC

SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port test
cases from PostgreSQL to Spark SQL

SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884> Deprecate
Python 2 support

SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> Convert
applicable *.sql tests into UDF integrated test base

SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow
dynamic allocation without an external shuffle service

SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust post
shuffle partition number in adaptive execution

SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move
Trigger implementations to Triggers.scala and avoid exposing these to the
end users

SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> Document
Spark WEB UI

SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399>
RobustScaler feature transformer

SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> Metadata
Handling in Thrift Server

SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build a SQL
reference doc

SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> Improve
test coverage of ThriftServer

SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753> Dynamically
reuse subqueries in AQE

SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove
outdated Experimental, Evolving annotations
SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908> SPARK-28980
<https://issues.apache.org/jira/browse/SPARK-28980> Remove deprecated items
since <= 2.2.0

Cheers,

Xingbo

Hyukjin Kwon <[email protected]> 于2019年10月7日周一 下午9:29写道：

> Cogroup Pandas UDF missing:
>
> SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> Support
> Dataframe Cogroup via Pandas UDFs
> Vectorized R execution:
>
> SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow
> optimization in SparkR's interoperability
>
>
> 2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim <[email protected]>님이
> 작성:
>
>> Thanks for bringing the nice summary of Spark 3.0 improvements!
>>
>> I'd like to add some items from structured streaming side,
>>
>> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move
>> Trigger implementations to Triggers.scala and avoid exposing these to the
>> end users (removal of deprecated)
>> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add
>> support for Kafka headers in Structured Streaming
>> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add
>> kafka delegation token support (there were follow-up issues to add
>> functionalities like support multi clusters, etc.)
>> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848>
>> Introduce new option to Kafka source: offset by timestamp (starting/ending)
>> SPARK-28074 <https://issues.apache.org/jira/browse/SPARK-28074> Log warn
>> message on possible correctness issue for multiple stateful operations in
>> single query
>>
>> and core side,
>>
>> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> New
>> feature: apply custom log URL pattern for executor log URLs in SHS
>> (follow-up issue expanded the functionality to Spark UI as well)
>>
>> FYI if we count on current work in progress, there's ongoing umbrella
>> issue regarding rolling event log & snapshot (SPARK-28594
>> <https://issues.apache.org/jira/browse/SPARK-28594>) which we struggle
>> to get things done in Spark 3.0.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <[email protected]>
>> wrote:
>>
>>> Hi all,
>>>
>>> I went over all the finished JIRA tickets targeted to Spark 3.0.0, here
>>> I'm listing all the notable features and major changes that are ready to
>>> test/deliver, please don't hesitate to add more to the list:
>>>
>>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215>
>>> Multiple columns support added to various Transformers: StringIndexer
>>>
>>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150>
>>> Implement Dynamic Partition Pruning
>>>
>>> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> Support
>>> Tree-Based Feature Transformation
>>>
>>> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add
>>> MultilabelClassificationEvaluator
>>>
>>> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add
>>> sample weights to decision trees
>>>
>>> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> Pushing
>>> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>>>
>>> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API
>>> for Power Iteration Clustering
>>>
>>> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> Improve
>>> logic for timing out executors in dynamic allocation
>>>
>>> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636>
>>> Eliminate unnecessary shuffle with adjacent Window expressions
>>>
>>> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> Acquire
>>> new executors to avoid hang because of blacklisting
>>>
>>> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796>
>>> Multiple columns support added to various Transformers: PySpark
>>> QuantileDiscretizer
>>>
>>> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new
>>> approach to do adaptive execution in Spark SQL
>>>
>>> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add
>>> Spark ML Listener for Tracking ML Pipeline Status
>>>
>>> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> Upgrade
>>> the built-in Hive to 2.3.5 for hadoop-3.2
>>>
>>> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add fit
>>> with validation set to Gradient Boosted Trees: Python API
>>>
>>> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build
>>> and Run Spark on JDK11
>>>
>>> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615>
>>> Accelerator-aware task scheduling for Spark
>>>
>>> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow
>>> sharing Netty's memory pool allocators
>>>
>>> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix
>>> race condition with tasks running when new attempt for same stage is
>>> created leads to other task in the next attempt running on the same
>>> partition id retry multiple times
>>>
>>> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> Support
>>> rolling back a shuffle map stage and re-generate the shuffle files
>>>
>>> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data
>>> source for binary files
>>>
>>> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603>
>>> Generalize Nested Column Pruning
>>>
>>> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove
>>> support for Scala 2.11 in Spark 3.0.0
>>>
>>> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define
>>> reserved keywords after SQL standard
>>>
>>> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow
>>> Pandas UDF to take an iterator of pd.DataFrames
>>>
>>> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data
>>> source v2 API refactor: streaming write
>>>
>>> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> remove
>>> streaming output mode from data source v2 APIs
>>>
>>> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create
>>> StreamingWrite at the beginning of streaming execution
>>>
>>> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not
>>> infer schema when reading Hive serde table with native data source
>>>
>>> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225>
>>> Implement join strategy hints
>>>
>>> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use
>>> pandas DataFrame for struct type argument in Scalar Pandas UDF
>>>
>>> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix
>>> deadlock between TaskMemoryManager and
>>> UnsafeExternalSorter$SpillableIterator
>>>
>>> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public
>>> APIs for extended Columnar Processing Support
>>>
>>> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589>
>>> Re-implement file sources with data source V2 API
>>>
>>> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677>
>>> Disk-persisted RDD blocks served by shuffle service, and ignored for
>>> Dynamic Allocation
>>>
>>> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699>
>>> Partially push down disjunctive predicated in Parquet/ORC
>>>
>>> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port
>>> test cases from PostgreSQL to Spark SQL (ongoing)
>>>
>>> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884>
>>> Deprecate Python 2 support
>>>
>>> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> Convert
>>> applicable *.sql tests into UDF integrated test base
>>>
>>> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow
>>> dynamic allocation without an external shuffle service
>>>
>>> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust
>>> post shuffle partition number in adaptive execution
>>>
>>> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372>
>>> Document Spark WEB UI
>>>
>>> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399>
>>> RobustScaler feature transformer
>>>
>>> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426>
>>> Metadata Handling in Thrift Server
>>>
>>> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build a
>>> SQL reference doc (ongoing)
>>>
>>> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> Improve
>>> test coverage of ThriftServer
>>>
>>> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753>
>>> Dynamically reuse subqueries in AQE
>>>
>>> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove
>>> outdated Experimental, Evolving annotations
>>> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908>
>>> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> Remove
>>> deprecated items since <= 2.2.0
>>>
>>> Cheers,
>>>
>>> Xingbo
>>>
>>

Re: Spark 3.0 preview release feature list and major changes

Reply via email to