Re: Spark 3.0 preview release feature list and major changes

2019-10-19 Thread Erik Erlandson
I'd like to get SPARK-27296 onto 3.0:
SPARK-27296  Efficient
User Defined Aggregators



On Mon, Oct 7, 2019 at 3:03 PM Xingbo Jiang  wrote:

> Hi all,
>
> I went over all the finished JIRA tickets targeted to Spark 3.0.0, here
> I'm listing all the notable features and major changes that are ready to
> test/deliver, please don't hesitate to add more to the list:
>
> SPARK-11215  Multiple
> columns support added to various Transformers: StringIndexer
>
> SPARK-11150  Implement
> Dynamic Partition Pruning
>
> SPARK-13677  Support
> Tree-Based Feature Transformation
>
> SPARK-16692  Add
> MultilabelClassificationEvaluator
>
> SPARK-19591  Add
> sample weights to decision trees
>
> SPARK-19712  Pushing
> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>
> SPARK-19827  R API for
> Power Iteration Clustering
>
> SPARK-20286  Improve
> logic for timing out executors in dynamic allocation
>
> SPARK-20636  Eliminate
> unnecessary shuffle with adjacent Window expressions
>
> SPARK-22148  Acquire
> new executors to avoid hang because of blacklisting
>
> SPARK-22796  Multiple
> columns support added to various Transformers: PySpark QuantileDiscretizer
>
> SPARK-23128  A new
> approach to do adaptive execution in Spark SQL
>
> SPARK-23674  Add Spark
> ML Listener for Tracking ML Pipeline Status
>
> SPARK-23710  Upgrade
> the built-in Hive to 2.3.5 for hadoop-3.2
>
> SPARK-24333  Add fit
> with validation set to Gradient Boosted Trees: Python API
>
> SPARK-24417  Build and
> Run Spark on JDK11
>
> SPARK-24615 
> Accelerator-aware task scheduling for Spark
>
> SPARK-24920  Allow
> sharing Netty's memory pool allocators
>
> SPARK-25250  Fix race
> condition with tasks running when new attempt for same stage is created
> leads to other task in the next attempt running on the same partition id
> retry multiple times
>
> SPARK-25341  Support
> rolling back a shuffle map stage and re-generate the shuffle files
>
> SPARK-25348  Data
> source for binary files
>
> SPARK-25603 
> Generalize Nested Column Pruning
>
> SPARK-26132  Remove
> support for Scala 2.11 in Spark 3.0.0
>
> SPARK-26215  define
> reserved keywords after SQL standard
>
> SPARK-26412  Allow
> Pandas UDF to take an iterator of pd.DataFrames
>
> SPARK-26785  data
> source v2 API refactor: streaming write
>
> SPARK-26956  remove
> streaming output mode from data source v2 APIs
>
> SPARK-27064  create
> StreamingWrite at the beginning of streaming execution
>
> SPARK-27119  Do not
> infer schema when reading Hive serde table with native data source
>
> SPARK-27225  Implement
> join strategy hints
>
> SPARK-27240  Use
> pandas DataFrame for struct type argument in Scalar Pandas UDF
>
> SPARK-27338  Fix
> deadlock between TaskMemoryManager and
> UnsafeExternalSorter$SpillableIterator
>
> SPARK-27396  Public
> APIs for extended Columnar Processing Support
>
> SPARK-27589 
> Re-implement file sources with data source V2 API
>
> SPARK-27677 
> Disk-persisted RDD blocks served by shuffle service, and ignored for
> Dynamic Allocation
>
> SPARK-27699  Partially
> push down disjunctive 

Re: Spark 3.0 preview release feature list and major changes

2019-10-10 Thread Weichen Xu
Wait... I have some supplement:

*New API:*
SPARK-25097 Support prediction on single instance in KMeans/BiKMeans/GMM
SPARK-28045 add missing RankingEvaluator
SPARK-29121 Support Dot Product for Vectors

*Behavior change or new API with behavior change:*
SPARK-23265 Update multi-column error handling logic in QuantileDiscretizer
SPARK-22798 Add multiple column support to PySpark StringIndexer
SPARK-11215 Add multiple columns support to StringIndexer
SPARK-24102 RegressionEvaluator should use sample weight data
SPARK-24101 MulticlassClassificationEvaluator should use sample weight data
SPARK-24103 BinaryClassificationEvaluator should use sample weight data
SPARK-23469 HashingTF should use corrected MurmurHash3 implementation

*Deprecated API removal:*
SPARK-25382 Remove ImageSchema.readImages in 3.0
SPARK-26133 Remove deprecated OneHotEncoder and rename
OneHotEncoderEstimator to OneHotEncoder
SPARK-25867 Remove KMeans computeCost
SPARK-28243 remove setFeatureSubsetStrategy and setSubsamplingRate from
Python TreeEnsembleParams

Thanks!

Weichen

On Fri, Oct 11, 2019 at 6:11 AM Xingbo Jiang  wrote:

> Hi all,
>
> Here is the updated feature list:
>
>
> SPARK-11215  Multiple
> columns support added to various Transformers: StringIndexer
>
> SPARK-11150  Implement
> Dynamic Partition Pruning
>
> SPARK-13677  Support
> Tree-Based Feature Transformation
>
> SPARK-16692  Add
> MultilabelClassificationEvaluator
>
> SPARK-19591  Add
> sample weights to decision trees
>
> SPARK-19712  Pushing
> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>
> SPARK-19827  R API for
> Power Iteration Clustering
>
> SPARK-20286  Improve
> logic for timing out executors in dynamic allocation
>
> SPARK-20636  Eliminate
> unnecessary shuffle with adjacent Window expressions
>
> SPARK-22148  Acquire
> new executors to avoid hang because of blacklisting
>
> SPARK-22796  Multiple
> columns support added to various Transformers: PySpark QuantileDiscretizer
>
> SPARK-23128  A new
> approach to do adaptive execution in Spark SQL
>
> SPARK-23155  Apply
> custom log URL pattern for executor log URLs in SHS
>
> SPARK-23539  Add
> support for Kafka headers
>
> SPARK-23674  Add Spark
> ML Listener for Tracking ML Pipeline Status
>
> SPARK-23710  Upgrade
> the built-in Hive to 2.3.5 for hadoop-3.2
>
> SPARK-24333  Add fit
> with validation set to Gradient Boosted Trees: Python API
>
> SPARK-24417  Build and
> Run Spark on JDK11
>
> SPARK-24615 
> Accelerator-aware task scheduling for Spark
>
> SPARK-24920  Allow
> sharing Netty's memory pool allocators
>
> SPARK-25250  Fix race
> condition with tasks running when new attempt for same stage is created
> leads to other task in the next attempt running on the same partition id
> retry multiple times
>
> SPARK-25341  Support
> rolling back a shuffle map stage and re-generate the shuffle files
>
> SPARK-25348  Data
> source for binary files
>
> SPARK-25390  data
> source V2 API refactoring
>
> SPARK-25501  Add Kafka
> delegation token support
>
> SPARK-25603 
> Generalize Nested Column Pruning
>
> SPARK-26132  Remove
> support for Scala 2.11 in Spark 3.0.0
>
> SPARK-26215  define
> reserved keywords after SQL standard
>
> SPARK-26412  Allow
> Pandas UDF to take an iterator of pd.DataFrames
>
> SPARK-26651  Use
> Proleptic Gregorian calendar
>
> SPARK-26759  Arrow
> optimization in SparkR's interoperability
>
> SPARK-26848 

Re: Spark 3.0 preview release feature list and major changes

2019-10-10 Thread Xingbo Jiang
Hi all,

Here is the updated feature list:


SPARK-11215  Multiple
columns support added to various Transformers: StringIndexer

SPARK-11150  Implement
Dynamic Partition Pruning

SPARK-13677  Support
Tree-Based Feature Transformation

SPARK-16692  Add
MultilabelClassificationEvaluator

SPARK-19591  Add sample
weights to decision trees

SPARK-19712  Pushing
Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827  R API for
Power Iteration Clustering

SPARK-20286  Improve
logic for timing out executors in dynamic allocation

SPARK-20636  Eliminate
unnecessary shuffle with adjacent Window expressions

SPARK-22148  Acquire new
executors to avoid hang because of blacklisting

SPARK-22796  Multiple
columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128  A new
approach to do adaptive execution in Spark SQL

SPARK-23155  Apply
custom log URL pattern for executor log URLs in SHS

SPARK-23539  Add support
for Kafka headers

SPARK-23674  Add Spark
ML Listener for Tracking ML Pipeline Status

SPARK-23710  Upgrade the
built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333  Add fit
with validation set to Gradient Boosted Trees: Python API

SPARK-24417  Build and
Run Spark on JDK11

SPARK-24615 
Accelerator-aware task scheduling for Spark

SPARK-24920  Allow
sharing Netty's memory pool allocators

SPARK-25250  Fix race
condition with tasks running when new attempt for same stage is created
leads to other task in the next attempt running on the same partition id
retry multiple times

SPARK-25341  Support
rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348  Data source
for binary files

SPARK-25390  data source
V2 API refactoring

SPARK-25501  Add Kafka
delegation token support

SPARK-25603  Generalize
Nested Column Pruning

SPARK-26132  Remove
support for Scala 2.11 in Spark 3.0.0

SPARK-26215  define
reserved keywords after SQL standard

SPARK-26412  Allow
Pandas UDF to take an iterator of pd.DataFrames

SPARK-26651  Use
Proleptic Gregorian calendar

SPARK-26759  Arrow
optimization in SparkR's interoperability

SPARK-26848  Introduce
new option to Kafka source: offset by timestamp (starting/ending)

SPARK-27064  create
StreamingWrite at the beginning of streaming execution

SPARK-27119  Do not
infer schema when reading Hive serde table with native data source

SPARK-27225  Implement
join strategy hints

SPARK-27240  Use pandas
DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338  Fix
deadlock between TaskMemoryManager and
UnsafeExternalSorter$SpillableIterator

SPARK-27396  Public APIs
for extended Columnar Processing Support

SPARK-27463  Support
Dataframe Cogroup via Pandas UDFs

SPARK-27589 
Re-implement file sources with data source V2 API

SPARK-27677 
Disk-persisted RDD blocks served by shuffle service, and ignored for
Dynamic Allocation

SPARK-27699 

Re: Spark 3.0 preview release feature list and major changes

2019-10-10 Thread Sean Owen
See the JIRA - this is too open-ended and not obviously just due to
choices in data representation, what you're trying to do, etc. It's
correctly closed IMHO.
However, identifying the issue more narrowly, and something that looks
ripe for optimization, would be useful.

On Thu, Oct 10, 2019 at 12:30 PM antonkulaga  wrote:
>
> I think for sure  SPARK-28547
> 
> At the moment there are some flows in Spark architecture and it performs
> miserably or even freezes everywhere where column number exceeds 10-15K
> (even simple describe function takes ages while the same functions with
> pandas and no Spark take seconds). In many fields (like bioinformatics) wide
> datasets with both large numbers of rows and columns are very common (gene
> expression data is a good example here) and Spark is totally useless there.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark 3.0 preview release feature list and major changes

2019-10-10 Thread antonkulaga
I think for sure  SPARK-28547
  
At the moment there are some flows in Spark architecture and it performs
miserably or even freezes everywhere where column number exceeds 10-15K
(even simple describe function takes ages while the same functions with
pandas and no Spark take seconds). In many fields (like bioinformatics) wide
datasets with both large numbers of rows and columns are very common (gene
expression data is a good example here) and Spark is totally useless there.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark 3.0 preview release feature list and major changes

2019-10-09 Thread Xiao Li
SPARK-29345  Add an API
that allows a user to define and observe arbitrary metrics on streaming
queries

Let us add this too.

Cheers,

Xiao

On Tue, Oct 8, 2019 at 10:31 PM Wenchen Fan  wrote:

> Regarding DS v2, I'd like to remove
> SPARK-26785  data
> source v2 API refactor: streaming write
> SPARK-26956  remove
> streaming output mode from data source v2 APIs
>
> and put the umbrella ticket instead
> SPARK-25390  data
> source V2 API refactoring
>
> Thanks,
> Wenchen
>
> On Wed, Oct 9, 2019 at 1:19 PM Dongjoon Hyun 
> wrote:
>
>> Thank you for the preparation of 3.0-preview, Xingbo!
>>
>> Bests,
>> Dongjoon.
>>
>> On Tue, Oct 8, 2019 at 2:32 PM Xingbo Jiang 
>> wrote:
>>
>>>  What's the process to propose a feature to be included in the final
 Spark 3.0 release?

>>>
>>> I don't know whether there exists any specific process here, normally
>>> you just merge the feature into Spark master before release code freeze,
>>> and then the feature would probably be included in the release. The code
>>> freeze date for Spark 3.0 has not been decided yet, though.
>>>
>>> Li Jin  于2019年10月8日周二 下午2:14写道:
>>>
 Thanks for summary!

 I have a question that is semi-related - What's the process to propose
 a feature to be included in the final Spark 3.0 release?

 In particular, I am interested in
 https://issues.apache.org/jira/browse/SPARK-28006.  I am happy to do
 the work so want to make sure I don't miss the "cut" date.

 On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang 
 wrote:

> Hi all,
>
> Thanks for all the feedbacks, here is the updated feature list:
>
> SPARK-11215 
> Multiple columns support added to various Transformers: StringIndexer
>
> SPARK-11150 
> Implement Dynamic Partition Pruning
>
> SPARK-13677 
> Support Tree-Based Feature Transformation
>
> SPARK-16692  Add
> MultilabelClassificationEvaluator
>
> SPARK-19591  Add
> sample weights to decision trees
>
> SPARK-19712 
> Pushing Left Semi and Left Anti joins through Project, Aggregate, Window,
> Union etc.
>
> SPARK-19827  R API
> for Power Iteration Clustering
>
> SPARK-20286 
> Improve logic for timing out executors in dynamic allocation
>
> SPARK-20636 
> Eliminate unnecessary shuffle with adjacent Window expressions
>
> SPARK-22148 
> Acquire new executors to avoid hang because of blacklisting
>
> SPARK-22796 
> Multiple columns support added to various Transformers: PySpark
> QuantileDiscretizer
>
> SPARK-23128  A new
> approach to do adaptive execution in Spark SQL
>
> SPARK-23155  Apply
> custom log URL pattern for executor log URLs in SHS
>
> SPARK-23539  Add
> support for Kafka headers
>
> SPARK-23674  Add
> Spark ML Listener for Tracking ML Pipeline Status
>
> SPARK-23710 
> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>
> SPARK-24333  Add
> fit with validation set to Gradient Boosted Trees: Python API
>
> SPARK-24417  Build
> and Run Spark on JDK11
>
> SPARK-24615 
> Accelerator-aware task scheduling for Spark
>
> SPARK-24920  Allow
> sharing Netty's memory pool allocators
>
> SPARK-25250  Fix
> race condition with tasks running when new attempt for same stage is
> created leads to other task in the next attempt running on the same
> partition id retry multiple times
>
> SPARK-25341 
> Support rolling back a shuffle map stage and re-generate the shuffle files
>
> SPARK-25348 

Re: Spark 3.0 preview release feature list and major changes

2019-10-08 Thread Wenchen Fan
Regarding DS v2, I'd like to remove
SPARK-26785  data source
v2 API refactor: streaming write
SPARK-26956  remove
streaming output mode from data source v2 APIs

and put the umbrella ticket instead
SPARK-25390  data source
V2 API refactoring

Thanks,
Wenchen

On Wed, Oct 9, 2019 at 1:19 PM Dongjoon Hyun 
wrote:

> Thank you for the preparation of 3.0-preview, Xingbo!
>
> Bests,
> Dongjoon.
>
> On Tue, Oct 8, 2019 at 2:32 PM Xingbo Jiang  wrote:
>
>>  What's the process to propose a feature to be included in the final
>>> Spark 3.0 release?
>>>
>>
>> I don't know whether there exists any specific process here, normally you
>> just merge the feature into Spark master before release code freeze, and
>> then the feature would probably be included in the release. The code freeze
>> date for Spark 3.0 has not been decided yet, though.
>>
>> Li Jin  于2019年10月8日周二 下午2:14写道:
>>
>>> Thanks for summary!
>>>
>>> I have a question that is semi-related - What's the process to propose a
>>> feature to be included in the final Spark 3.0 release?
>>>
>>> In particular, I am interested in
>>> https://issues.apache.org/jira/browse/SPARK-28006.  I am happy to do
>>> the work so want to make sure I don't miss the "cut" date.
>>>
>>> On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang 
>>> wrote:
>>>
 Hi all,

 Thanks for all the feedbacks, here is the updated feature list:

 SPARK-11215 
 Multiple columns support added to various Transformers: StringIndexer

 SPARK-11150 
 Implement Dynamic Partition Pruning

 SPARK-13677 
 Support Tree-Based Feature Transformation

 SPARK-16692  Add
 MultilabelClassificationEvaluator

 SPARK-19591  Add
 sample weights to decision trees

 SPARK-19712 
 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window,
 Union etc.

 SPARK-19827  R API
 for Power Iteration Clustering

 SPARK-20286 
 Improve logic for timing out executors in dynamic allocation

 SPARK-20636 
 Eliminate unnecessary shuffle with adjacent Window expressions

 SPARK-22148 
 Acquire new executors to avoid hang because of blacklisting

 SPARK-22796 
 Multiple columns support added to various Transformers: PySpark
 QuantileDiscretizer

 SPARK-23128  A new
 approach to do adaptive execution in Spark SQL

 SPARK-23155  Apply
 custom log URL pattern for executor log URLs in SHS

 SPARK-23539  Add
 support for Kafka headers

 SPARK-23674  Add
 Spark ML Listener for Tracking ML Pipeline Status

 SPARK-23710 
 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2

 SPARK-24333  Add
 fit with validation set to Gradient Boosted Trees: Python API

 SPARK-24417  Build
 and Run Spark on JDK11

 SPARK-24615 
 Accelerator-aware task scheduling for Spark

 SPARK-24920  Allow
 sharing Netty's memory pool allocators

 SPARK-25250  Fix
 race condition with tasks running when new attempt for same stage is
 created leads to other task in the next attempt running on the same
 partition id retry multiple times

 SPARK-25341 
 Support rolling back a shuffle map stage and re-generate the shuffle files

 SPARK-25348  Data
 source for binary files

 SPARK-25501  Add
 kafka delegation token support

 SPARK-25603 
 Generalize Nested Column Pruning

 SPARK-26132  Remove

Re: Spark 3.0 preview release feature list and major changes

2019-10-08 Thread Dongjoon Hyun
Thank you for the preparation of 3.0-preview, Xingbo!

Bests,
Dongjoon.

On Tue, Oct 8, 2019 at 2:32 PM Xingbo Jiang  wrote:

>  What's the process to propose a feature to be included in the final Spark
>> 3.0 release?
>>
>
> I don't know whether there exists any specific process here, normally you
> just merge the feature into Spark master before release code freeze, and
> then the feature would probably be included in the release. The code freeze
> date for Spark 3.0 has not been decided yet, though.
>
> Li Jin  于2019年10月8日周二 下午2:14写道:
>
>> Thanks for summary!
>>
>> I have a question that is semi-related - What's the process to propose a
>> feature to be included in the final Spark 3.0 release?
>>
>> In particular, I am interested in
>> https://issues.apache.org/jira/browse/SPARK-28006.  I am happy to do the
>> work so want to make sure I don't miss the "cut" date.
>>
>> On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang 
>> wrote:
>>
>>> Hi all,
>>>
>>> Thanks for all the feedbacks, here is the updated feature list:
>>>
>>> SPARK-11215 
>>> Multiple columns support added to various Transformers: StringIndexer
>>>
>>> SPARK-11150 
>>> Implement Dynamic Partition Pruning
>>>
>>> SPARK-13677  Support
>>> Tree-Based Feature Transformation
>>>
>>> SPARK-16692  Add
>>> MultilabelClassificationEvaluator
>>>
>>> SPARK-19591  Add
>>> sample weights to decision trees
>>>
>>> SPARK-19712  Pushing
>>> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>>>
>>> SPARK-19827  R API
>>> for Power Iteration Clustering
>>>
>>> SPARK-20286  Improve
>>> logic for timing out executors in dynamic allocation
>>>
>>> SPARK-20636 
>>> Eliminate unnecessary shuffle with adjacent Window expressions
>>>
>>> SPARK-22148  Acquire
>>> new executors to avoid hang because of blacklisting
>>>
>>> SPARK-22796 
>>> Multiple columns support added to various Transformers: PySpark
>>> QuantileDiscretizer
>>>
>>> SPARK-23128  A new
>>> approach to do adaptive execution in Spark SQL
>>>
>>> SPARK-23155  Apply
>>> custom log URL pattern for executor log URLs in SHS
>>>
>>> SPARK-23539  Add
>>> support for Kafka headers
>>>
>>> SPARK-23674  Add
>>> Spark ML Listener for Tracking ML Pipeline Status
>>>
>>> SPARK-23710  Upgrade
>>> the built-in Hive to 2.3.5 for hadoop-3.2
>>>
>>> SPARK-24333  Add fit
>>> with validation set to Gradient Boosted Trees: Python API
>>>
>>> SPARK-24417  Build
>>> and Run Spark on JDK11
>>>
>>> SPARK-24615 
>>> Accelerator-aware task scheduling for Spark
>>>
>>> SPARK-24920  Allow
>>> sharing Netty's memory pool allocators
>>>
>>> SPARK-25250  Fix
>>> race condition with tasks running when new attempt for same stage is
>>> created leads to other task in the next attempt running on the same
>>> partition id retry multiple times
>>>
>>> SPARK-25341  Support
>>> rolling back a shuffle map stage and re-generate the shuffle files
>>>
>>> SPARK-25348  Data
>>> source for binary files
>>>
>>> SPARK-25501  Add
>>> kafka delegation token support
>>>
>>> SPARK-25603 
>>> Generalize Nested Column Pruning
>>>
>>> SPARK-26132  Remove
>>> support for Scala 2.11 in Spark 3.0.0
>>>
>>> SPARK-26215  define
>>> reserved keywords after SQL standard
>>>
>>> SPARK-26412  Allow
>>> Pandas UDF to take an iterator of pd.DataFrames
>>>
>>> SPARK-26759  Arrow
>>> optimization in SparkR's interoperability
>>>
>>> SPARK-26785  data
>>> source v2 API refactor: streaming write
>>>
>>> SPARK-26848 

Re: Spark 3.0 preview release feature list and major changes

2019-10-08 Thread Xingbo Jiang
>
>  What's the process to propose a feature to be included in the final Spark
> 3.0 release?
>

I don't know whether there exists any specific process here, normally you
just merge the feature into Spark master before release code freeze, and
then the feature would probably be included in the release. The code freeze
date for Spark 3.0 has not been decided yet, though.

Li Jin  于2019年10月8日周二 下午2:14写道:

> Thanks for summary!
>
> I have a question that is semi-related - What's the process to propose a
> feature to be included in the final Spark 3.0 release?
>
> In particular, I am interested in
> https://issues.apache.org/jira/browse/SPARK-28006.  I am happy to do the
> work so want to make sure I don't miss the "cut" date.
>
> On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang  wrote:
>
>> Hi all,
>>
>> Thanks for all the feedbacks, here is the updated feature list:
>>
>> SPARK-11215  Multiple
>> columns support added to various Transformers: StringIndexer
>>
>> SPARK-11150 
>> Implement Dynamic Partition Pruning
>>
>> SPARK-13677  Support
>> Tree-Based Feature Transformation
>>
>> SPARK-16692  Add
>> MultilabelClassificationEvaluator
>>
>> SPARK-19591  Add
>> sample weights to decision trees
>>
>> SPARK-19712  Pushing
>> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>>
>> SPARK-19827  R API
>> for Power Iteration Clustering
>>
>> SPARK-20286  Improve
>> logic for timing out executors in dynamic allocation
>>
>> SPARK-20636 
>> Eliminate unnecessary shuffle with adjacent Window expressions
>>
>> SPARK-22148  Acquire
>> new executors to avoid hang because of blacklisting
>>
>> SPARK-22796  Multiple
>> columns support added to various Transformers: PySpark QuantileDiscretizer
>>
>> SPARK-23128  A new
>> approach to do adaptive execution in Spark SQL
>>
>> SPARK-23155  Apply
>> custom log URL pattern for executor log URLs in SHS
>>
>> SPARK-23539  Add
>> support for Kafka headers
>>
>> SPARK-23674  Add
>> Spark ML Listener for Tracking ML Pipeline Status
>>
>> SPARK-23710  Upgrade
>> the built-in Hive to 2.3.5 for hadoop-3.2
>>
>> SPARK-24333  Add fit
>> with validation set to Gradient Boosted Trees: Python API
>>
>> SPARK-24417  Build
>> and Run Spark on JDK11
>>
>> SPARK-24615 
>> Accelerator-aware task scheduling for Spark
>>
>> SPARK-24920  Allow
>> sharing Netty's memory pool allocators
>>
>> SPARK-25250  Fix race
>> condition with tasks running when new attempt for same stage is created
>> leads to other task in the next attempt running on the same partition id
>> retry multiple times
>>
>> SPARK-25341  Support
>> rolling back a shuffle map stage and re-generate the shuffle files
>>
>> SPARK-25348  Data
>> source for binary files
>>
>> SPARK-25501  Add
>> kafka delegation token support
>>
>> SPARK-25603 
>> Generalize Nested Column Pruning
>>
>> SPARK-26132  Remove
>> support for Scala 2.11 in Spark 3.0.0
>>
>> SPARK-26215  define
>> reserved keywords after SQL standard
>>
>> SPARK-26412  Allow
>> Pandas UDF to take an iterator of pd.DataFrames
>>
>> SPARK-26759  Arrow
>> optimization in SparkR's interoperability
>>
>> SPARK-26785  data
>> source v2 API refactor: streaming write
>>
>> SPARK-26848 
>> Introduce new option to Kafka source: offset by timestamp (starting/ending)
>>
>> SPARK-26956  remove
>> streaming output mode from data source v2 APIs
>>
>> SPARK-27064 

Re: Spark 3.0 preview release feature list and major changes

2019-10-08 Thread Li Jin
Thanks for summary!

I have a question that is semi-related - What's the process to propose a
feature to be included in the final Spark 3.0 release?

In particular, I am interested in
https://issues.apache.org/jira/browse/SPARK-28006.  I am happy to do the
work so want to make sure I don't miss the "cut" date.

On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang  wrote:

> Hi all,
>
> Thanks for all the feedbacks, here is the updated feature list:
>
> SPARK-11215  Multiple
> columns support added to various Transformers: StringIndexer
>
> SPARK-11150  Implement
> Dynamic Partition Pruning
>
> SPARK-13677  Support
> Tree-Based Feature Transformation
>
> SPARK-16692  Add
> MultilabelClassificationEvaluator
>
> SPARK-19591  Add
> sample weights to decision trees
>
> SPARK-19712  Pushing
> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>
> SPARK-19827  R API for
> Power Iteration Clustering
>
> SPARK-20286  Improve
> logic for timing out executors in dynamic allocation
>
> SPARK-20636  Eliminate
> unnecessary shuffle with adjacent Window expressions
>
> SPARK-22148  Acquire
> new executors to avoid hang because of blacklisting
>
> SPARK-22796  Multiple
> columns support added to various Transformers: PySpark QuantileDiscretizer
>
> SPARK-23128  A new
> approach to do adaptive execution in Spark SQL
>
> SPARK-23155  Apply
> custom log URL pattern for executor log URLs in SHS
>
> SPARK-23539  Add
> support for Kafka headers
>
> SPARK-23674  Add Spark
> ML Listener for Tracking ML Pipeline Status
>
> SPARK-23710  Upgrade
> the built-in Hive to 2.3.5 for hadoop-3.2
>
> SPARK-24333  Add fit
> with validation set to Gradient Boosted Trees: Python API
>
> SPARK-24417  Build and
> Run Spark on JDK11
>
> SPARK-24615 
> Accelerator-aware task scheduling for Spark
>
> SPARK-24920  Allow
> sharing Netty's memory pool allocators
>
> SPARK-25250  Fix race
> condition with tasks running when new attempt for same stage is created
> leads to other task in the next attempt running on the same partition id
> retry multiple times
>
> SPARK-25341  Support
> rolling back a shuffle map stage and re-generate the shuffle files
>
> SPARK-25348  Data
> source for binary files
>
> SPARK-25501  Add kafka
> delegation token support
>
> SPARK-25603 
> Generalize Nested Column Pruning
>
> SPARK-26132  Remove
> support for Scala 2.11 in Spark 3.0.0
>
> SPARK-26215  define
> reserved keywords after SQL standard
>
> SPARK-26412  Allow
> Pandas UDF to take an iterator of pd.DataFrames
>
> SPARK-26759  Arrow
> optimization in SparkR's interoperability
>
> SPARK-26785  data
> source v2 API refactor: streaming write
>
> SPARK-26848  Introduce
> new option to Kafka source: offset by timestamp (starting/ending)
>
> SPARK-26956  remove
> streaming output mode from data source v2 APIs
>
> SPARK-27064  create
> StreamingWrite at the beginning of streaming execution
>
> SPARK-27119  Do not
> infer schema when reading Hive serde table with native data source
>
> SPARK-27225  Implement
> join strategy hints
>
> SPARK-27240  Use
> pandas DataFrame for struct type argument in Scalar Pandas UDF
>
> SPARK-27338 

Re: Spark 3.0 preview release feature list and major changes

2019-10-08 Thread Xingbo Jiang
Hi all,

Thanks for all the feedbacks, here is the updated feature list:

SPARK-11215  Multiple
columns support added to various Transformers: StringIndexer

SPARK-11150  Implement
Dynamic Partition Pruning

SPARK-13677  Support
Tree-Based Feature Transformation

SPARK-16692  Add
MultilabelClassificationEvaluator

SPARK-19591  Add sample
weights to decision trees

SPARK-19712  Pushing
Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827  R API for
Power Iteration Clustering

SPARK-20286  Improve
logic for timing out executors in dynamic allocation

SPARK-20636  Eliminate
unnecessary shuffle with adjacent Window expressions

SPARK-22148  Acquire new
executors to avoid hang because of blacklisting

SPARK-22796  Multiple
columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128  A new
approach to do adaptive execution in Spark SQL

SPARK-23155  Apply
custom log URL pattern for executor log URLs in SHS

SPARK-23539  Add support
for Kafka headers

SPARK-23674  Add Spark
ML Listener for Tracking ML Pipeline Status

SPARK-23710  Upgrade the
built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333  Add fit
with validation set to Gradient Boosted Trees: Python API

SPARK-24417  Build and
Run Spark on JDK11

SPARK-24615 
Accelerator-aware task scheduling for Spark

SPARK-24920  Allow
sharing Netty's memory pool allocators

SPARK-25250  Fix race
condition with tasks running when new attempt for same stage is created
leads to other task in the next attempt running on the same partition id
retry multiple times

SPARK-25341  Support
rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348  Data source
for binary files

SPARK-25501  Add kafka
delegation token support

SPARK-25603  Generalize
Nested Column Pruning

SPARK-26132  Remove
support for Scala 2.11 in Spark 3.0.0

SPARK-26215  define
reserved keywords after SQL standard

SPARK-26412  Allow
Pandas UDF to take an iterator of pd.DataFrames

SPARK-26759  Arrow
optimization in SparkR's interoperability

SPARK-26785  data source
v2 API refactor: streaming write

SPARK-26848  Introduce
new option to Kafka source: offset by timestamp (starting/ending)

SPARK-26956  remove
streaming output mode from data source v2 APIs

SPARK-27064  create
StreamingWrite at the beginning of streaming execution

SPARK-27119  Do not
infer schema when reading Hive serde table with native data source

SPARK-27225  Implement
join strategy hints

SPARK-27240  Use pandas
DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338  Fix
deadlock between TaskMemoryManager and
UnsafeExternalSorter$SpillableIterator

SPARK-27396  Public APIs
for extended Columnar Processing Support

SPARK-27463  Support
Dataframe Cogroup via Pandas UDFs

SPARK-27589 
Re-implement file sources with data source V2 API

SPARK-27677 
Disk-persisted RDD blocks served by shuffle 

Re: Spark 3.0 preview release feature list and major changes

2019-10-07 Thread Hyukjin Kwon
Cogroup Pandas UDF missing:

SPARK-27463  Support
Dataframe Cogroup via Pandas UDFs
Vectorized R execution:

SPARK-26759  Arrow
optimization in SparkR's interoperability


2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim 님이 작성:

> Thanks for bringing the nice summary of Spark 3.0 improvements!
>
> I'd like to add some items from structured streaming side,
>
> SPARK-28199  Move
> Trigger implementations to Triggers.scala and avoid exposing these to the
> end users (removal of deprecated)
> SPARK-23539  Add
> support for Kafka headers in Structured Streaming
> SPARK-25501  Add kafka
> delegation token support (there were follow-up issues to add
> functionalities like support multi clusters, etc.)
> SPARK-26848  Introduce
> new option to Kafka source: offset by timestamp (starting/ending)
> SPARK-28074  Log warn
> message on possible correctness issue for multiple stateful operations in
> single query
>
> and core side,
>
> SPARK-23155  New
> feature: apply custom log URL pattern for executor log URLs in SHS
> (follow-up issue expanded the functionality to Spark UI as well)
>
> FYI if we count on current work in progress, there's ongoing umbrella
> issue regarding rolling event log & snapshot (SPARK-28594
> ) which we struggle to
> get things done in Spark 3.0.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang  wrote:
>
>> Hi all,
>>
>> I went over all the finished JIRA tickets targeted to Spark 3.0.0, here
>> I'm listing all the notable features and major changes that are ready to
>> test/deliver, please don't hesitate to add more to the list:
>>
>> SPARK-11215  Multiple
>> columns support added to various Transformers: StringIndexer
>>
>> SPARK-11150 
>> Implement Dynamic Partition Pruning
>>
>> SPARK-13677  Support
>> Tree-Based Feature Transformation
>>
>> SPARK-16692  Add
>> MultilabelClassificationEvaluator
>>
>> SPARK-19591  Add
>> sample weights to decision trees
>>
>> SPARK-19712  Pushing
>> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>>
>> SPARK-19827  R API
>> for Power Iteration Clustering
>>
>> SPARK-20286  Improve
>> logic for timing out executors in dynamic allocation
>>
>> SPARK-20636 
>> Eliminate unnecessary shuffle with adjacent Window expressions
>>
>> SPARK-22148  Acquire
>> new executors to avoid hang because of blacklisting
>>
>> SPARK-22796  Multiple
>> columns support added to various Transformers: PySpark QuantileDiscretizer
>>
>> SPARK-23128  A new
>> approach to do adaptive execution in Spark SQL
>>
>> SPARK-23674  Add
>> Spark ML Listener for Tracking ML Pipeline Status
>>
>> SPARK-23710  Upgrade
>> the built-in Hive to 2.3.5 for hadoop-3.2
>>
>> SPARK-24333  Add fit
>> with validation set to Gradient Boosted Trees: Python API
>>
>> SPARK-24417  Build
>> and Run Spark on JDK11
>>
>> SPARK-24615 
>> Accelerator-aware task scheduling for Spark
>>
>> SPARK-24920  Allow
>> sharing Netty's memory pool allocators
>>
>> SPARK-25250  Fix race
>> condition with tasks running when new attempt for same stage is created
>> leads to other task in the next attempt running on the same partition id
>> retry multiple times
>>
>> SPARK-25341  Support
>> rolling back a shuffle map stage and re-generate the shuffle files
>>
>> SPARK-25348  Data
>> source for binary files
>>
>> SPARK-25603 
>> Generalize Nested Column Pruning
>>
>> SPARK-26132 

Re: Spark 3.0 preview release feature list and major changes

2019-10-07 Thread Jungtaek Lim
Thanks for bringing the nice summary of Spark 3.0 improvements!

I'd like to add some items from structured streaming side,

SPARK-28199  Move
Trigger implementations to Triggers.scala and avoid exposing these to the
end users (removal of deprecated)
SPARK-23539  Add support
for Kafka headers in Structured Streaming
SPARK-25501  Add kafka
delegation token support (there were follow-up issues to add
functionalities like support multi clusters, etc.)
SPARK-26848  Introduce
new option to Kafka source: offset by timestamp (starting/ending)
SPARK-28074  Log warn
message on possible correctness issue for multiple stateful operations in
single query

and core side,

SPARK-23155  New
feature: apply custom log URL pattern for executor log URLs in SHS
(follow-up issue expanded the functionality to Spark UI as well)

FYI if we count on current work in progress, there's ongoing umbrella issue
regarding rolling event log & snapshot (SPARK-28594
) which we struggle to
get things done in Spark 3.0.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang  wrote:

> Hi all,
>
> I went over all the finished JIRA tickets targeted to Spark 3.0.0, here
> I'm listing all the notable features and major changes that are ready to
> test/deliver, please don't hesitate to add more to the list:
>
> SPARK-11215  Multiple
> columns support added to various Transformers: StringIndexer
>
> SPARK-11150  Implement
> Dynamic Partition Pruning
>
> SPARK-13677  Support
> Tree-Based Feature Transformation
>
> SPARK-16692  Add
> MultilabelClassificationEvaluator
>
> SPARK-19591  Add
> sample weights to decision trees
>
> SPARK-19712  Pushing
> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>
> SPARK-19827  R API for
> Power Iteration Clustering
>
> SPARK-20286  Improve
> logic for timing out executors in dynamic allocation
>
> SPARK-20636  Eliminate
> unnecessary shuffle with adjacent Window expressions
>
> SPARK-22148  Acquire
> new executors to avoid hang because of blacklisting
>
> SPARK-22796  Multiple
> columns support added to various Transformers: PySpark QuantileDiscretizer
>
> SPARK-23128  A new
> approach to do adaptive execution in Spark SQL
>
> SPARK-23674  Add Spark
> ML Listener for Tracking ML Pipeline Status
>
> SPARK-23710  Upgrade
> the built-in Hive to 2.3.5 for hadoop-3.2
>
> SPARK-24333  Add fit
> with validation set to Gradient Boosted Trees: Python API
>
> SPARK-24417  Build and
> Run Spark on JDK11
>
> SPARK-24615 
> Accelerator-aware task scheduling for Spark
>
> SPARK-24920  Allow
> sharing Netty's memory pool allocators
>
> SPARK-25250  Fix race
> condition with tasks running when new attempt for same stage is created
> leads to other task in the next attempt running on the same partition id
> retry multiple times
>
> SPARK-25341  Support
> rolling back a shuffle map stage and re-generate the shuffle files
>
> SPARK-25348  Data
> source for binary files
>
> SPARK-25603 
> Generalize Nested Column Pruning
>
> SPARK-26132  Remove
> support for Scala 2.11 in Spark 3.0.0
>
> SPARK-26215  define
> reserved keywords after SQL standard
>
> SPARK-26412  Allow
> Pandas UDF to take an iterator of pd.DataFrames
>
> SPARK-26785  data
> source v2 API refactor: streaming write
>
> SPARK-26956 

Spark 3.0 preview release feature list and major changes

2019-10-07 Thread Xingbo Jiang
Hi all,

I went over all the finished JIRA tickets targeted to Spark 3.0.0, here I'm
listing all the notable features and major changes that are ready to
test/deliver, please don't hesitate to add more to the list:

SPARK-11215  Multiple
columns support added to various Transformers: StringIndexer

SPARK-11150  Implement
Dynamic Partition Pruning

SPARK-13677  Support
Tree-Based Feature Transformation

SPARK-16692  Add
MultilabelClassificationEvaluator

SPARK-19591  Add sample
weights to decision trees

SPARK-19712  Pushing
Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827  R API for
Power Iteration Clustering

SPARK-20286  Improve
logic for timing out executors in dynamic allocation

SPARK-20636  Eliminate
unnecessary shuffle with adjacent Window expressions

SPARK-22148  Acquire new
executors to avoid hang because of blacklisting

SPARK-22796  Multiple
columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128  A new
approach to do adaptive execution in Spark SQL

SPARK-23674  Add Spark
ML Listener for Tracking ML Pipeline Status

SPARK-23710  Upgrade the
built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333  Add fit
with validation set to Gradient Boosted Trees: Python API

SPARK-24417  Build and
Run Spark on JDK11

SPARK-24615 
Accelerator-aware task scheduling for Spark

SPARK-24920  Allow
sharing Netty's memory pool allocators

SPARK-25250  Fix race
condition with tasks running when new attempt for same stage is created
leads to other task in the next attempt running on the same partition id
retry multiple times

SPARK-25341  Support
rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348  Data source
for binary files

SPARK-25603  Generalize
Nested Column Pruning

SPARK-26132  Remove
support for Scala 2.11 in Spark 3.0.0

SPARK-26215  define
reserved keywords after SQL standard

SPARK-26412  Allow
Pandas UDF to take an iterator of pd.DataFrames

SPARK-26785  data source
v2 API refactor: streaming write

SPARK-26956  remove
streaming output mode from data source v2 APIs

SPARK-27064  create
StreamingWrite at the beginning of streaming execution

SPARK-27119  Do not
infer schema when reading Hive serde table with native data source

SPARK-27225  Implement
join strategy hints

SPARK-27240  Use pandas
DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338  Fix
deadlock between TaskMemoryManager and
UnsafeExternalSorter$SpillableIterator

SPARK-27396  Public APIs
for extended Columnar Processing Support

SPARK-27589 
Re-implement file sources with data source V2 API

SPARK-27677 
Disk-persisted RDD blocks served by shuffle service, and ignored for
Dynamic Allocation

SPARK-27699  Partially
push down disjunctive predicated in Parquet/ORC

SPARK-27763  Port test
cases from PostgreSQL to Spark SQL (ongoing)

SPARK-27884  Deprecate
Python 2 support

SPARK-27921  Convert
applicable *.sql tests into UDF integrated test base

SPARK-27963