[jira] [Created] (SPARK-35662) Support Timestamp without time zone data type

2021-06-07 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-35662:
--

 Summary: Support Timestamp without time zone data type
 Key: SPARK-35662
 URL: https://issues.apache.org/jira/browse/SPARK-35662
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.0
Reporter: Gengliang Wang
Assignee: Apache Spark


Spark SQL today supports the TIMESTAMP data type. However the semantics 
provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. 
Timestamps embedded in a SQL query or passed through JDBC are presumed to be in 
session local timezone and cast to UTC before being processed.
These are desirable semantics in many cases, such as when dealing with 
calendars.
In many (more) other cases, such as when dealing with log files it is desirable 
that the provided timestamps not be altered.
SQL users expect that they can model either behavior and do so by using 
TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH 
LOCAL TIME ZONE for time zone sensitive data.
Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will be 
surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a feature that does not exist 
in the standard.

In this new feature, we will introduce TIMESTAMP WITH LOCAL TIMEZONE to 
describe the existing timestamp type and add TIMESTAMP WITHOUT TIME ZONE for 
standard semantic.
Using these two types will provide clarity.
We will also allow users to set the default behavior for TIMESTAMP to either 
use TIMESTAMP WITH LOCAL TIME ZONE or TIMESTAMP WITHOUT TIME ZONE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35662) Support Timestamp without time zone data type

2021-06-07 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-35662:
---
Description: 
Spark SQL today supports the TIMESTAMP data type. However the semantics 
provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. 
Timestamps embedded in a SQL query or passed through JDBC are presumed to be in 
session local timezone and cast to UTC before being processed.
These are desirable semantics in many cases, such as when dealing with 
calendars.
In many (more) other cases, such as when dealing with log files it is desirable 
that the provided timestamps not be altered.
SQL users expect that they can model either behavior and do so by using 
TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH 
LOCAL TIME ZONE for time zone sensitive data.
Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will be 
surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a feature that does not exist 
in the standard.

In this new feature, we will introduce TIMESTAMP WITH LOCAL TIMEZONE to 
describe the existing timestamp type and add TIMESTAMP WITHOUT TIME ZONE for 
standard semantic.
Using these two types will provide clarity.
We will also allow users to set the default behavior for TIMESTAMP to either 
use TIMESTAMP WITH LOCAL TIME ZONE or TIMESTAMP WITHOUT TIME ZONE.

h3. Milestone 1 – Spark Timestamp equivalency ( The new Timestamp type 
TimestampNTZ meets or exceeds all function of the existing SQL Timestamp):

* Add a new DataType implementation for TimestampNTZ.
* Support TimestampNTZ in Dataset/UDF.
* TimestampNTZ literals 
* TimestampNTZ arithmetic(e.g. TimestampNTZ - TimestampNTZ, TimestampNTZ - Date)
* Datetime functions/operators: dayofweek, weekofyear, year, etc
* Cast to and from TimestampNTZ, cast String/Timestamp to TimestampNTZ, cast 
TimestampNTZ to string (pretty printing)/Timestamp, with the SQL syntax to 
specify the types
* Support sorting TimestampNTZ.

h3. Milestone 2 – Persistence:

* Ability to create tables of type TimestampNTZ
* Ability to write to common file formats such as Parquet and JSON.
* INSERT, SELECT, UPDATE, MERGE
* Discovery

h3. Milestone 3 – Client support

* JDBC support
* Hive Thrift server

h3. Milestone 4 – PySpark and Spark R integration

* Python UDF can take and return intervals
* DataFrame support

  was:
Spark SQL today supports the TIMESTAMP data type. However the semantics 
provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. 
Timestamps embedded in a SQL query or passed through JDBC are presumed to be in 
session local timezone and cast to UTC before being processed.
These are desirable semantics in many cases, such as when dealing with 
calendars.
In many (more) other cases, such as when dealing with log files it is desirable 
that the provided timestamps not be altered.
SQL users expect that they can model either behavior and do so by using 
TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH 
LOCAL TIME ZONE for time zone sensitive data.
Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will be 
surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a feature that does not exist 
in the standard.

In this new feature, we will introduce TIMESTAMP WITH LOCAL TIMEZONE to 
describe the existing timestamp type and add TIMESTAMP WITHOUT TIME ZONE for 
standard semantic.
Using these two types will provide clarity.
We will also allow users to set the default behavior for TIMESTAMP to either 
use TIMESTAMP WITH LOCAL TIME ZONE or TIMESTAMP WITHOUT TIME ZONE.


> Support Timestamp without time zone data type
> -
>
> Key: SPARK-35662
> URL: https://issues.apache.org/jira/browse/SPARK-35662
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Spark SQL today supports the TIMESTAMP data type. However the semantics 
> provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. 
> Timestamps embedded in a SQL query or passed through JDBC are presumed to be 
> in session local timezone and cast to UTC before being processed.
> These are desirable semantics in many cases, such as when dealing with 
> calendars.
> In many (more) other cases, such as when dealing with log files it is 
> desirable that the provided timestamps not be altered.
> SQL users expect that they can model either behavior and do so by using 
> TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH 
> LOCAL TIME ZONE for time zone sensitive data.
> Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will 
> be surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a featu

[jira] [Updated] (SPARK-35662) Support Timestamp without time zone data type

2021-06-07 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-35662:
---
Description: 
Spark SQL today supports the TIMESTAMP data type. However the semantics 
provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. 
Timestamps embedded in a SQL query or passed through JDBC are presumed to be in 
session local timezone and cast to UTC before being processed.
 These are desirable semantics in many cases, such as when dealing with 
calendars.
 In many (more) other cases, such as when dealing with log files it is 
desirable that the provided timestamps not be altered.
 SQL users expect that they can model either behavior and do so by using 
TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH 
LOCAL TIME ZONE for time zone sensitive data.
 Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will 
be surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a feature that does not 
exist in the standard.

In this new feature, we will introduce TIMESTAMP WITH LOCAL TIMEZONE to 
describe the existing timestamp type and add TIMESTAMP WITHOUT TIME ZONE for 
standard semantic.
 Using these two types will provide clarity.
 We will also allow users to set the default behavior for TIMESTAMP to either 
use TIMESTAMP WITH LOCAL TIME ZONE or TIMESTAMP WITHOUT TIME ZONE.
h3. Milestone 1 – Spark Timestamp equivalency ( The new Timestamp type 
TimestampNTZ meets or exceeds all function of the existing SQL Timestamp):
 * Add a new DataType implementation for TimestampNTZ.
 * Support TimestampNTZ in Dataset/UDF.
 * TimestampNTZ literals
 * TimestampNTZ arithmetic(e.g. TimestampNTZ - TimestampNTZ, TimestampNTZ - 
Date)
 * Datetime functions/operators: dayofweek, weekofyear, year, etc
 * Cast to and from TimestampNTZ, cast String/Timestamp to TimestampNTZ, cast 
TimestampNTZ to string (pretty printing)/Timestamp, with the SQL syntax to 
specify the types
 * Support sorting TimestampNTZ.

h3. Milestone 2 – Persistence:
 * Ability to create tables of type TimestampNTZ
 * Ability to write to common file formats such as Parquet and JSON.
 * INSERT, SELECT, UPDATE, MERGE
 * Discovery

h3. Milestone 3 – Client support
 * JDBC support
 * Hive Thrift server

h3. Milestone 4 – PySpark and Spark R integration
 * Python UDF can take and return TimestampNTZ
 * DataFrame support

  was:
Spark SQL today supports the TIMESTAMP data type. However the semantics 
provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. 
Timestamps embedded in a SQL query or passed through JDBC are presumed to be in 
session local timezone and cast to UTC before being processed.
These are desirable semantics in many cases, such as when dealing with 
calendars.
In many (more) other cases, such as when dealing with log files it is desirable 
that the provided timestamps not be altered.
SQL users expect that they can model either behavior and do so by using 
TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH 
LOCAL TIME ZONE for time zone sensitive data.
Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will be 
surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a feature that does not exist 
in the standard.

In this new feature, we will introduce TIMESTAMP WITH LOCAL TIMEZONE to 
describe the existing timestamp type and add TIMESTAMP WITHOUT TIME ZONE for 
standard semantic.
Using these two types will provide clarity.
We will also allow users to set the default behavior for TIMESTAMP to either 
use TIMESTAMP WITH LOCAL TIME ZONE or TIMESTAMP WITHOUT TIME ZONE.

h3. Milestone 1 – Spark Timestamp equivalency ( The new Timestamp type 
TimestampNTZ meets or exceeds all function of the existing SQL Timestamp):

* Add a new DataType implementation for TimestampNTZ.
* Support TimestampNTZ in Dataset/UDF.
* TimestampNTZ literals 
* TimestampNTZ arithmetic(e.g. TimestampNTZ - TimestampNTZ, TimestampNTZ - Date)
* Datetime functions/operators: dayofweek, weekofyear, year, etc
* Cast to and from TimestampNTZ, cast String/Timestamp to TimestampNTZ, cast 
TimestampNTZ to string (pretty printing)/Timestamp, with the SQL syntax to 
specify the types
* Support sorting TimestampNTZ.

h3. Milestone 2 – Persistence:

* Ability to create tables of type TimestampNTZ
* Ability to write to common file formats such as Parquet and JSON.
* INSERT, SELECT, UPDATE, MERGE
* Discovery

h3. Milestone 3 – Client support

* JDBC support
* Hive Thrift server

h3. Milestone 4 – PySpark and Spark R integration

* Python UDF can take and return intervals
* DataFrame support


> Support Timestamp without time zone data type
> -
>
> Key: SPARK-35662
> URL: https://issues.apache.org/jira/browse/SPARK-35662
> Project: Spark
>  Issue Type: New Feature
> 

[jira] [Created] (SPARK-35663) Add Timestamp without time zone type

2021-06-07 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-35663:
--

 Summary: Add Timestamp without time zone type
 Key: SPARK-35663
 URL: https://issues.apache.org/jira/browse/SPARK-35663
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Extend Catalyst's type system by a new type that conform to the SQL standard 
(see SQL:2016, section 4.6.2):

* TimestampNTZType represents the timestamp without time zone type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35646) Merge contents and remove obsolete pages in API reference section

2021-06-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35646:


Assignee: Hyukjin Kwon

> Merge contents and remove obsolete pages in API reference section
> -
>
> Key: SPARK-35646
> URL: https://issues.apache.org/jira/browse/SPARK-35646
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Now Koalas documentation is in PySpark documentations. We should probably now 
> remove obsolete pages such as blog post and talks. Also, we should refine and 
> merge contents properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35646) Merge contents and remove obsolete pages in API reference section

2021-06-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35646.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32799
[https://github.com/apache/spark/pull/32799]

> Merge contents and remove obsolete pages in API reference section
> -
>
> Key: SPARK-35646
> URL: https://issues.apache.org/jira/browse/SPARK-35646
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.0
>
>
> Now Koalas documentation is in PySpark documentations. We should probably now 
> remove obsolete pages such as blog post and talks. Also, we should refine and 
> merge contents properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35664) Support java.time. LocalDateTime as an external type of TimestampNTZ type

2021-06-07 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-35664:
--

 Summary: Support java.time. LocalDateTime as an external type of 
TimestampNTZ type
 Key: SPARK-35664
 URL: https://issues.apache.org/jira/browse/SPARK-35664
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Allow parallelization/collection of java.time.LocalDateTime values, and convert 
the values to timestamp values of TimestampNTZType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35663) Add Timestamp without time zone type

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35663:


Assignee: Gengliang Wang  (was: Apache Spark)

> Add Timestamp without time zone type
> 
>
> Key: SPARK-35663
> URL: https://issues.apache.org/jira/browse/SPARK-35663
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Extend Catalyst's type system by a new type that conform to the SQL standard 
> (see SQL:2016, section 4.6.2):
> * TimestampNTZType represents the timestamp without time zone type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35663) Add Timestamp without time zone type

2021-06-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358415#comment-17358415
 ] 

Apache Spark commented on SPARK-35663:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/32802

> Add Timestamp without time zone type
> 
>
> Key: SPARK-35663
> URL: https://issues.apache.org/jira/browse/SPARK-35663
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Extend Catalyst's type system by a new type that conform to the SQL standard 
> (see SQL:2016, section 4.6.2):
> * TimestampNTZType represents the timestamp without time zone type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35663) Add Timestamp without time zone type

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35663:


Assignee: Apache Spark  (was: Gengliang Wang)

> Add Timestamp without time zone type
> 
>
> Key: SPARK-35663
> URL: https://issues.apache.org/jira/browse/SPARK-35663
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Extend Catalyst's type system by a new type that conform to the SQL standard 
> (see SQL:2016, section 4.6.2):
> * TimestampNTZType represents the timestamp without time zone type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35664) Support java.time. LocalDateTime as an external type of TimestampNTZ type

2021-06-07 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358416#comment-17358416
 ] 

Gengliang Wang commented on SPARK-35664:


I am working on this.

> Support java.time. LocalDateTime as an external type of TimestampNTZ type
> -
>
> Key: SPARK-35664
> URL: https://issues.apache.org/jira/browse/SPARK-35664
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Allow parallelization/collection of java.time.LocalDateTime values, and 
> convert the values to timestamp values of TimestampNTZType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35665) resolve UnresolvedAlias in CollectMetrics

2021-06-07 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-35665:
---

 Summary: resolve UnresolvedAlias in CollectMetrics
 Key: SPARK-35665
 URL: https://issues.apache.org/jira/browse/SPARK-35665
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35665) resolve UnresolvedAlias in CollectMetrics

2021-06-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358456#comment-17358456
 ] 

Apache Spark commented on SPARK-35665:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/32803

> resolve UnresolvedAlias in CollectMetrics
> -
>
> Key: SPARK-35665
> URL: https://issues.apache.org/jira/browse/SPARK-35665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35665) resolve UnresolvedAlias in CollectMetrics

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35665:


Assignee: Apache Spark

> resolve UnresolvedAlias in CollectMetrics
> -
>
> Key: SPARK-35665
> URL: https://issues.apache.org/jira/browse/SPARK-35665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35665) resolve UnresolvedAlias in CollectMetrics

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35665:


Assignee: (was: Apache Spark)

> resolve UnresolvedAlias in CollectMetrics
> -
>
> Key: SPARK-35665
> URL: https://issues.apache.org/jira/browse/SPARK-35665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35507) Move Python 3.9 installtation to the docker image for GitHub Actions

2021-06-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35507:
-

Assignee: Dongjoon Hyun  (was: Apache Spark)

> Move Python 3.9 installtation to the docker image for GitHub Actions
> 
>
> Key: SPARK-35507
> URL: https://issues.apache.org/jira/browse/SPARK-35507
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>
> SPARK-35506 added Python 3.9 support but it had to manually install it.
> The installed packages and Python versions should go to the docker image.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26867) Spark Support of YARN Placement Constraint

2021-06-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358461#comment-17358461
 ] 

Apache Spark commented on SPARK-26867:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32804

> Spark Support of YARN Placement Constraint
> --
>
> Key: SPARK-26867
> URL: https://issues.apache.org/jira/browse/SPARK-26867
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, YARN
>Affects Versions: 3.1.0
>Reporter: Prabhu Joseph
>Priority: Major
>
> YARN provides Placement Constraint Features - where application can request 
> containers based on affinity / anti-affinity / cardinality to services or 
> other application containers / node attributes. This is a useful feature for 
> Spark Jobs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26867) Spark Support of YARN Placement Constraint

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26867:


Assignee: Apache Spark

> Spark Support of YARN Placement Constraint
> --
>
> Key: SPARK-26867
> URL: https://issues.apache.org/jira/browse/SPARK-26867
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, YARN
>Affects Versions: 3.1.0
>Reporter: Prabhu Joseph
>Assignee: Apache Spark
>Priority: Major
>
> YARN provides Placement Constraint Features - where application can request 
> containers based on affinity / anti-affinity / cardinality to services or 
> other application containers / node attributes. This is a useful feature for 
> Spark Jobs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26867) Spark Support of YARN Placement Constraint

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26867:


Assignee: (was: Apache Spark)

> Spark Support of YARN Placement Constraint
> --
>
> Key: SPARK-26867
> URL: https://issues.apache.org/jira/browse/SPARK-26867
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, YARN
>Affects Versions: 3.1.0
>Reporter: Prabhu Joseph
>Priority: Major
>
> YARN provides Placement Constraint Features - where application can request 
> containers based on affinity / anti-affinity / cardinality to services or 
> other application containers / node attributes. This is a useful feature for 
> Spark Jobs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26867) Spark Support of YARN Placement Constraint

2021-06-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358463#comment-17358463
 ] 

Apache Spark commented on SPARK-26867:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32804

> Spark Support of YARN Placement Constraint
> --
>
> Key: SPARK-26867
> URL: https://issues.apache.org/jira/browse/SPARK-26867
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, YARN
>Affects Versions: 3.1.0
>Reporter: Prabhu Joseph
>Priority: Major
>
> YARN provides Placement Constraint Features - where application can request 
> containers based on affinity / anti-affinity / cardinality to services or 
> other application containers / node attributes. This is a useful feature for 
> Spark Jobs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35591) Rename "Koalas" to "pandas API on Spark" in the documents

2021-06-07 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358538#comment-17358538
 ] 

Haejoon Lee commented on SPARK-35591:
-

[~pingsutw] Thanks for the concern, but I almost done with this work. Let me 
finish this! :)

> Rename "Koalas" to "pandas API on Spark" in the documents
> -
>
> Key: SPARK-35591
> URL: https://issues.apache.org/jira/browse/SPARK-35591
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should fix "Koalas" to "pandas on Spark" after initial porting of 
> documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35666) add new gemv to skip array shape checking

2021-06-07 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-35666:


 Summary: add new gemv to skip array shape checking
 Key: SPARK-35666
 URL: https://issues.apache.org/jira/browse/SPARK-35666
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 3.2.0
Reporter: zhengruifeng


In existing impls, it is common case that the vector/matrix need to be 
sliced/copied just due to shape match.

We can enhane the gemv function to avoid this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35665) resolve UnresolvedAlias in CollectMetrics

2021-06-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35665:


Assignee: Wenchen Fan

> resolve UnresolvedAlias in CollectMetrics
> -
>
> Key: SPARK-35665
> URL: https://issues.apache.org/jira/browse/SPARK-35665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35665) resolve UnresolvedAlias in CollectMetrics

2021-06-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35665.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32803
[https://github.com/apache/spark/pull/32803]

> resolve UnresolvedAlias in CollectMetrics
> -
>
> Key: SPARK-35665
> URL: https://issues.apache.org/jira/browse/SPARK-35665
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35666) add new gemv to skip array shape checking

2021-06-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358557#comment-17358557
 ] 

Apache Spark commented on SPARK-35666:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/32805

> add new gemv to skip array shape checking
> -
>
> Key: SPARK-35666
> URL: https://issues.apache.org/jira/browse/SPARK-35666
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Major
>
> In existing impls, it is common case that the vector/matrix need to be 
> sliced/copied just due to shape match.
> We can enhane the gemv function to avoid this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35666) add new gemv to skip array shape checking

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35666:


Assignee: Apache Spark

> add new gemv to skip array shape checking
> -
>
> Key: SPARK-35666
> URL: https://issues.apache.org/jira/browse/SPARK-35666
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>
> In existing impls, it is common case that the vector/matrix need to be 
> sliced/copied just due to shape match.
> We can enhane the gemv function to avoid this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35666) add new gemv to skip array shape checking

2021-06-07 Thread Apache Spark (Jira)


[jira] [Commented] (SPARK-35666) add new gemv to skip array shape checking

2021-06-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358558#comment-17358558
 ] 

Apache Spark commented on SPARK-35666:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/32805

> add new gemv to skip array shape checking
> -
>
> Key: SPARK-35666
> URL: https://issues.apache.org/jira/browse/SPARK-35666
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Major
>
> In existing impls, it is common case that the vector/matrix need to be 
> sliced/copied just due to shape match.
> We can enhane the gemv function to avoid this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35667) spark.speculation causes incorrect query results with TRANSFORM

2021-06-07 Thread yuanxm (Jira)
yuanxm created SPARK-35667:
--

 Summary: spark.speculation causes incorrect query results with 
TRANSFORM
 Key: SPARK-35667
 URL: https://issues.apache.org/jira/browse/SPARK-35667
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.8
 Environment: {code:java}
./bin/spark-sql --master yarn \ 
--conf spark.speculation=true \ 
--conf spark.shuffle.service.enabled=true \ 
--conf spark.dynamicAllocation.enabled=true \ 
--conf spark.dynamicAllocation.executorIdleTimeout=5s \ 
--conf spark.dynamicAllocation.initialExecutor=1 \ 
--conf spark.dynamicAllocation.maxExecutors=40
{code}
 
Reporter: yuanxm


SQL as follow gets incorrect results sometimes when spark.speculation is true: 
{code:java}
SELECT count(1)
FROM
  (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt)
   FROM
 (SELECT dt
  FROM test_table)tmpa1)tmpa2{code}
With spark.speculation=true, the result of count is less than the correct one. 
It's more likely to get incorrect answer when there is more speculative tasks. 

`test.py`:
{code:java}
import sys
for line in sys.stdin:
line = line.strip()
arr = line.split()
print "\t".join(arr){code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35667) spark.speculation causes incorrect query results with TRANSFORM

2021-06-07 Thread yuanxm (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuanxm updated SPARK-35667:
---
Environment: (was: {code:java}
./bin/spark-sql --master yarn \ 
--conf spark.speculation=true \ 
--conf spark.shuffle.service.enabled=true \ 
--conf spark.dynamicAllocation.enabled=true \ 
--conf spark.dynamicAllocation.executorIdleTimeout=5s \ 
--conf spark.dynamicAllocation.initialExecutor=1 \ 
--conf spark.dynamicAllocation.maxExecutors=40
{code}
 )

> spark.speculation causes incorrect query results with TRANSFORM
> ---
>
> Key: SPARK-35667
> URL: https://issues.apache.org/jira/browse/SPARK-35667
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8
>Reporter: yuanxm
>Priority: Major
>
> SQL as follow gets incorrect results sometimes when spark.speculation is 
> true: 
> {code:java}
> SELECT count(1)
> FROM
>   (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt)
>FROM
>  (SELECT dt
>   FROM test_table)tmpa1)tmpa2{code}
> With spark.speculation=true, the result of count is less than the correct 
> one. It's more likely to get incorrect answer when there is more speculative 
> tasks. 
> `test.py`:
> {code:java}
> import sys
> for line in sys.stdin:
> line = line.strip()
> arr = line.split()
> print "\t".join(arr){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35667) spark.speculation causes incorrect query results with TRANSFORM

2021-06-07 Thread yuanxm (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuanxm updated SPARK-35667:
---
Description: 
SQL as follow gets incorrect results sometimes when spark.speculation is true: 
{code:java}
SELECT count(1)
FROM
  (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt)
   FROM
 (SELECT dt
  FROM test_table)tmpa1)tmpa2{code}
With spark.speculation=true, the result of count is less than the correct one. 
It's more likely to get incorrect answer when there is more speculative tasks. 

`test.py`:
{code:java}
import sys
for line in sys.stdin:
line = line.strip()
arr = line.split()
print "\t".join(arr){code}
 

spark-sql command:
{code:java}
./bin/spark-sql --master yarn \ 
--conf spark.speculation=true \ 
--conf spark.shuffle.service.enabled=true \ 
--conf spark.dynamicAllocation.enabled=true \ 
--conf spark.dynamicAllocation.executorIdleTimeout=5s \ 
--conf spark.dynamicAllocation.initialExecutor=1 \ 
--conf spark.dynamicAllocation.maxExecutors=40
{code}

  was:
SQL as follow gets incorrect results sometimes when spark.speculation is true: 
{code:java}
SELECT count(1)
FROM
  (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt)
   FROM
 (SELECT dt
  FROM test_table)tmpa1)tmpa2{code}
With spark.speculation=true, the result of count is less than the correct one. 
It's more likely to get incorrect answer when there is more speculative tasks. 

`test.py`:
{code:java}
import sys
for line in sys.stdin:
line = line.strip()
arr = line.split()
print "\t".join(arr){code}
 


> spark.speculation causes incorrect query results with TRANSFORM
> ---
>
> Key: SPARK-35667
> URL: https://issues.apache.org/jira/browse/SPARK-35667
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8
>Reporter: yuanxm
>Priority: Major
>
> SQL as follow gets incorrect results sometimes when spark.speculation is 
> true: 
> {code:java}
> SELECT count(1)
> FROM
>   (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt)
>FROM
>  (SELECT dt
>   FROM test_table)tmpa1)tmpa2{code}
> With spark.speculation=true, the result of count is less than the correct 
> one. It's more likely to get incorrect answer when there is more speculative 
> tasks. 
> `test.py`:
> {code:java}
> import sys
> for line in sys.stdin:
> line = line.strip()
> arr = line.split()
> print "\t".join(arr){code}
>  
> spark-sql command:
> {code:java}
> ./bin/spark-sql --master yarn \ 
> --conf spark.speculation=true \ 
> --conf spark.shuffle.service.enabled=true \ 
> --conf spark.dynamicAllocation.enabled=true \ 
> --conf spark.dynamicAllocation.executorIdleTimeout=5s \ 
> --conf spark.dynamicAllocation.initialExecutor=1 \ 
> --conf spark.dynamicAllocation.maxExecutors=40
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35667) spark.speculation causes incorrect query results with TRANSFORM

2021-06-07 Thread yuanxm (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuanxm updated SPARK-35667:
---
Description: 
SQL as follow gets incorrect results sometimes when spark.speculation is true: 
{code:java}
SELECT count(1)
FROM
  (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt)
   FROM
 (SELECT dt
  FROM test_table)tmpa1)tmpa2{code}
With spark.speculation=true, the count result is less than the correct one. 
It's more likely to get incorrect answer when there is more speculative tasks. 

`test.py`:
{code:java}
import sys
for line in sys.stdin:
line = line.strip()
arr = line.split()
print "\t".join(arr){code}
 

spark-sql command:
{code:java}
./bin/spark-sql --master yarn \ 
--conf spark.speculation=true \ 
--conf spark.shuffle.service.enabled=true \ 
--conf spark.dynamicAllocation.enabled=true \ 
--conf spark.dynamicAllocation.executorIdleTimeout=5s \ 
--conf spark.dynamicAllocation.initialExecutor=1 \ 
--conf spark.dynamicAllocation.maxExecutors=40
{code}

  was:
SQL as follow gets incorrect results sometimes when spark.speculation is true: 
{code:java}
SELECT count(1)
FROM
  (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt)
   FROM
 (SELECT dt
  FROM test_table)tmpa1)tmpa2{code}
With spark.speculation=true, the result of count is less than the correct one. 
It's more likely to get incorrect answer when there is more speculative tasks. 

`test.py`:
{code:java}
import sys
for line in sys.stdin:
line = line.strip()
arr = line.split()
print "\t".join(arr){code}
 

spark-sql command:
{code:java}
./bin/spark-sql --master yarn \ 
--conf spark.speculation=true \ 
--conf spark.shuffle.service.enabled=true \ 
--conf spark.dynamicAllocation.enabled=true \ 
--conf spark.dynamicAllocation.executorIdleTimeout=5s \ 
--conf spark.dynamicAllocation.initialExecutor=1 \ 
--conf spark.dynamicAllocation.maxExecutors=40
{code}


> spark.speculation causes incorrect query results with TRANSFORM
> ---
>
> Key: SPARK-35667
> URL: https://issues.apache.org/jira/browse/SPARK-35667
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8
>Reporter: yuanxm
>Priority: Major
>
> SQL as follow gets incorrect results sometimes when spark.speculation is 
> true: 
> {code:java}
> SELECT count(1)
> FROM
>   (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt)
>FROM
>  (SELECT dt
>   FROM test_table)tmpa1)tmpa2{code}
> With spark.speculation=true, the count result is less than the correct one. 
> It's more likely to get incorrect answer when there is more speculative 
> tasks. 
> `test.py`:
> {code:java}
> import sys
> for line in sys.stdin:
> line = line.strip()
> arr = line.split()
> print "\t".join(arr){code}
>  
> spark-sql command:
> {code:java}
> ./bin/spark-sql --master yarn \ 
> --conf spark.speculation=true \ 
> --conf spark.shuffle.service.enabled=true \ 
> --conf spark.dynamicAllocation.enabled=true \ 
> --conf spark.dynamicAllocation.executorIdleTimeout=5s \ 
> --conf spark.dynamicAllocation.initialExecutor=1 \ 
> --conf spark.dynamicAllocation.maxExecutors=40
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35668) Use "concurrency" setting on Github Actions

2021-06-07 Thread Yikun Jiang (Jira)
Yikun Jiang created SPARK-35668:
---

 Summary: Use "concurrency" setting on Github Actions
 Key: SPARK-35668
 URL: https://issues.apache.org/jira/browse/SPARK-35668
 Project: Spark
  Issue Type: Test
  Components: Project Infra
Affects Versions: 3.1.2
Reporter: Yikun Jiang


We are using 
[cancel_duplicate_workflow_runs](https://github.com/apache/spark/blob/a70e66ecfa638cacc99b4e9a7c464e41ec92ad30/.github/workflows/cancel_duplicate_workflow_runs.yml#L1)
 job to cancel previous jobs when a new job is queued. Now, it has been 
supported by the github action by using 
["concurrency"](https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#concurrency)
 syntax to make sure only a single job or workflow using the same concurrency 
group.

related: https://github.com/apache/arrow/pull/10416 and 
https://github.com/potiuk/cancel-workflow-runs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35668) Use "concurrency" setting on Github Actions

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35668:


Assignee: Apache Spark

> Use "concurrency" setting on Github Actions
> ---
>
> Key: SPARK-35668
> URL: https://issues.apache.org/jira/browse/SPARK-35668
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.1.2
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>
> We are using 
> [cancel_duplicate_workflow_runs](https://github.com/apache/spark/blob/a70e66ecfa638cacc99b4e9a7c464e41ec92ad30/.github/workflows/cancel_duplicate_workflow_runs.yml#L1)
>  job to cancel previous jobs when a new job is queued. Now, it has been 
> supported by the github action by using 
> ["concurrency"](https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#concurrency)
>  syntax to make sure only a single job or workflow using the same concurrency 
> group.
> related: https://github.com/apache/arrow/pull/10416 and 
> https://github.com/potiuk/cancel-workflow-runs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35668) Use "concurrency" setting on Github Actions

2021-06-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358596#comment-17358596
 ] 

Apache Spark commented on SPARK-35668:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32806

> Use "concurrency" setting on Github Actions
> ---
>
> Key: SPARK-35668
> URL: https://issues.apache.org/jira/browse/SPARK-35668
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.1.2
>Reporter: Yikun Jiang
>Priority: Major
>
> We are using 
> [cancel_duplicate_workflow_runs](https://github.com/apache/spark/blob/a70e66ecfa638cacc99b4e9a7c464e41ec92ad30/.github/workflows/cancel_duplicate_workflow_runs.yml#L1)
>  job to cancel previous jobs when a new job is queued. Now, it has been 
> supported by the github action by using 
> ["concurrency"](https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#concurrency)
>  syntax to make sure only a single job or workflow using the same concurrency 
> group.
> related: https://github.com/apache/arrow/pull/10416 and 
> https://github.com/potiuk/cancel-workflow-runs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35668) Use "concurrency" setting on Github Actions

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35668:


Assignee: (was: Apache Spark)

> Use "concurrency" setting on Github Actions
> ---
>
> Key: SPARK-35668
> URL: https://issues.apache.org/jira/browse/SPARK-35668
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.1.2
>Reporter: Yikun Jiang
>Priority: Major
>
> We are using 
> [cancel_duplicate_workflow_runs](https://github.com/apache/spark/blob/a70e66ecfa638cacc99b4e9a7c464e41ec92ad30/.github/workflows/cancel_duplicate_workflow_runs.yml#L1)
>  job to cancel previous jobs when a new job is queued. Now, it has been 
> supported by the github action by using 
> ["concurrency"](https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#concurrency)
>  syntax to make sure only a single job or workflow using the same concurrency 
> group.
> related: https://github.com/apache/arrow/pull/10416 and 
> https://github.com/potiuk/cancel-workflow-runs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35668) Use "concurrency" setting on Github Actions

2021-06-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358597#comment-17358597
 ] 

Apache Spark commented on SPARK-35668:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32806

> Use "concurrency" setting on Github Actions
> ---
>
> Key: SPARK-35668
> URL: https://issues.apache.org/jira/browse/SPARK-35668
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.1.2
>Reporter: Yikun Jiang
>Priority: Major
>
> We are using 
> [cancel_duplicate_workflow_runs](https://github.com/apache/spark/blob/a70e66ecfa638cacc99b4e9a7c464e41ec92ad30/.github/workflows/cancel_duplicate_workflow_runs.yml#L1)
>  job to cancel previous jobs when a new job is queued. Now, it has been 
> supported by the github action by using 
> ["concurrency"](https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#concurrency)
>  syntax to make sure only a single job or workflow using the same concurrency 
> group.
> related: https://github.com/apache/arrow/pull/10416 and 
> https://github.com/potiuk/cancel-workflow-runs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35543) Small memory leak in BlockManagerMasterEndpoint

2021-06-07 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros resolved SPARK-35543.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32790
[https://github.com/apache/spark/pull/32790]

> Small memory leak in BlockManagerMasterEndpoint 
> 
>
> Key: SPARK-35543
> URL: https://issues.apache.org/jira/browse/SPARK-35543
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Minor
> Fix For: 3.2.0
>
>
> It is regarding _blockStatusByShuffleService_ when all the blocks are removed 
> for a bmId the map entry can be cleaned too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35669) fix special char in CSV header with filter pushdown

2021-06-07 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-35669:
---

 Summary: fix special char in CSV header with filter pushdown
 Key: SPARK-35669
 URL: https://issues.apache.org/jira/browse/SPARK-35669
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35667) spark.speculation causes incorrect query results with TRANSFORM

2021-06-07 Thread yuanxm (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuanxm updated SPARK-35667:
---
Description: 
SQL as follow gets incorrect results sometimes when spark.speculation is true: 
{code:java}
SELECT count(1)
FROM
  (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt)
   FROM
 (SELECT dt
  FROM test_table)tmpa1)tmpa2{code}
With spark.speculation=true, the count result is less than the correct one. 
It's more likely to get incorrect result when there is more speculative tasks. 

`test.py`:
{code:java}
import sys
for line in sys.stdin:
line = line.strip()
arr = line.split()
print "\t".join(arr){code}
 

spark-sql command:
{code:java}
./bin/spark-sql --master yarn \ 
--conf spark.speculation=true \ 
--conf spark.shuffle.service.enabled=true \ 
--conf spark.dynamicAllocation.enabled=true \ 
--conf spark.dynamicAllocation.executorIdleTimeout=5s \ 
--conf spark.dynamicAllocation.initialExecutor=1 \ 
--conf spark.dynamicAllocation.maxExecutors=40
{code}

  was:
SQL as follow gets incorrect results sometimes when spark.speculation is true: 
{code:java}
SELECT count(1)
FROM
  (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt)
   FROM
 (SELECT dt
  FROM test_table)tmpa1)tmpa2{code}
With spark.speculation=true, the count result is less than the correct one. 
It's more likely to get incorrect answer when there is more speculative tasks. 

`test.py`:
{code:java}
import sys
for line in sys.stdin:
line = line.strip()
arr = line.split()
print "\t".join(arr){code}
 

spark-sql command:
{code:java}
./bin/spark-sql --master yarn \ 
--conf spark.speculation=true \ 
--conf spark.shuffle.service.enabled=true \ 
--conf spark.dynamicAllocation.enabled=true \ 
--conf spark.dynamicAllocation.executorIdleTimeout=5s \ 
--conf spark.dynamicAllocation.initialExecutor=1 \ 
--conf spark.dynamicAllocation.maxExecutors=40
{code}


> spark.speculation causes incorrect query results with TRANSFORM
> ---
>
> Key: SPARK-35667
> URL: https://issues.apache.org/jira/browse/SPARK-35667
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8
>Reporter: yuanxm
>Priority: Major
>
> SQL as follow gets incorrect results sometimes when spark.speculation is 
> true: 
> {code:java}
> SELECT count(1)
> FROM
>   (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt)
>FROM
>  (SELECT dt
>   FROM test_table)tmpa1)tmpa2{code}
> With spark.speculation=true, the count result is less than the correct one. 
> It's more likely to get incorrect result when there is more speculative 
> tasks. 
> `test.py`:
> {code:java}
> import sys
> for line in sys.stdin:
> line = line.strip()
> arr = line.split()
> print "\t".join(arr){code}
>  
> spark-sql command:
> {code:java}
> ./bin/spark-sql --master yarn \ 
> --conf spark.speculation=true \ 
> --conf spark.shuffle.service.enabled=true \ 
> --conf spark.dynamicAllocation.enabled=true \ 
> --conf spark.dynamicAllocation.executorIdleTimeout=5s \ 
> --conf spark.dynamicAllocation.initialExecutor=1 \ 
> --conf spark.dynamicAllocation.maxExecutors=40
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35667) spark.speculation causes incorrect query results with TRANSFORM

2021-06-07 Thread yuanxm (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuanxm updated SPARK-35667:
---
Description: 
SQL as follow gets incorrect results sometimes when spark.speculation is true: 
{code:java}
SELECT count(1)
FROM
  (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt)
   FROM
 (SELECT dt
  FROM test_table)tmpa1)tmpa2{code}
With spark.speculation=true, the count result is less than the correct one. 
It's more likely to get incorrect result when there are more speculative tasks. 

`test.py`:
{code:java}
import sys
for line in sys.stdin:
line = line.strip()
arr = line.split()
print "\t".join(arr){code}
 

spark-sql command:
{code:java}
./bin/spark-sql --master yarn \ 
--conf spark.speculation=true \ 
--conf spark.shuffle.service.enabled=true \ 
--conf spark.dynamicAllocation.enabled=true \ 
--conf spark.dynamicAllocation.executorIdleTimeout=5s \ 
--conf spark.dynamicAllocation.initialExecutor=1 \ 
--conf spark.dynamicAllocation.maxExecutors=40
{code}

  was:
SQL as follow gets incorrect results sometimes when spark.speculation is true: 
{code:java}
SELECT count(1)
FROM
  (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt)
   FROM
 (SELECT dt
  FROM test_table)tmpa1)tmpa2{code}
With spark.speculation=true, the count result is less than the correct one. 
It's more likely to get incorrect result when there is more speculative tasks. 

`test.py`:
{code:java}
import sys
for line in sys.stdin:
line = line.strip()
arr = line.split()
print "\t".join(arr){code}
 

spark-sql command:
{code:java}
./bin/spark-sql --master yarn \ 
--conf spark.speculation=true \ 
--conf spark.shuffle.service.enabled=true \ 
--conf spark.dynamicAllocation.enabled=true \ 
--conf spark.dynamicAllocation.executorIdleTimeout=5s \ 
--conf spark.dynamicAllocation.initialExecutor=1 \ 
--conf spark.dynamicAllocation.maxExecutors=40
{code}


> spark.speculation causes incorrect query results with TRANSFORM
> ---
>
> Key: SPARK-35667
> URL: https://issues.apache.org/jira/browse/SPARK-35667
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8
>Reporter: yuanxm
>Priority: Major
>
> SQL as follow gets incorrect results sometimes when spark.speculation is 
> true: 
> {code:java}
> SELECT count(1)
> FROM
>   (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt)
>FROM
>  (SELECT dt
>   FROM test_table)tmpa1)tmpa2{code}
> With spark.speculation=true, the count result is less than the correct one. 
> It's more likely to get incorrect result when there are more speculative 
> tasks. 
> `test.py`:
> {code:java}
> import sys
> for line in sys.stdin:
> line = line.strip()
> arr = line.split()
> print "\t".join(arr){code}
>  
> spark-sql command:
> {code:java}
> ./bin/spark-sql --master yarn \ 
> --conf spark.speculation=true \ 
> --conf spark.shuffle.service.enabled=true \ 
> --conf spark.dynamicAllocation.enabled=true \ 
> --conf spark.dynamicAllocation.executorIdleTimeout=5s \ 
> --conf spark.dynamicAllocation.initialExecutor=1 \ 
> --conf spark.dynamicAllocation.maxExecutors=40
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35669) fix special char in CSV header with filter pushdown

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35669:


Assignee: (was: Apache Spark)

> fix special char in CSV header with filter pushdown
> ---
>
> Key: SPARK-35669
> URL: https://issues.apache.org/jira/browse/SPARK-35669
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35669) fix special char in CSV header with filter pushdown

2021-06-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358626#comment-17358626
 ] 

Apache Spark commented on SPARK-35669:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/32807

> fix special char in CSV header with filter pushdown
> ---
>
> Key: SPARK-35669
> URL: https://issues.apache.org/jira/browse/SPARK-35669
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35669) fix special char in CSV header with filter pushdown

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35669:


Assignee: Apache Spark

> fix special char in CSV header with filter pushdown
> ---
>
> Key: SPARK-35669
> URL: https://issues.apache.org/jira/browse/SPARK-35669
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35663) Add Timestamp without time zone type

2021-06-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35663.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32802
[https://github.com/apache/spark/pull/32802]

> Add Timestamp without time zone type
> 
>
> Key: SPARK-35663
> URL: https://issues.apache.org/jira/browse/SPARK-35663
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Extend Catalyst's type system by a new type that conform to the SQL standard 
> (see SQL:2016, section 4.6.2):
> * TimestampNTZType represents the timestamp without time zone type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35074) spark.jars.xxx configs should be moved to config/package.scala

2021-06-07 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-35074.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

> spark.jars.xxx configs should be moved to config/package.scala
> --
>
> Key: SPARK-35074
> URL: https://issues.apache.org/jira/browse/SPARK-35074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Priority: Trivial
> Fix For: 3.2.0
>
>
> Currently {{spark.jars.xxx}} property keys (e.g. {{spark.jars.ivySettings}} 
> and {{spark.jars.packages}}) are hardcoded in multiple places within Spark 
> code across multiple modules. We should define them in 
> {{config/package.scala}} and reference them in all other places.
> This came up during reviews of SPARK-34472 at 
> https://github.com/apache/spark/pull/31591#discussion_r584848624



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35598) Improve Spark-ML PCA analysis

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35598:


Assignee: Apache Spark

> Improve Spark-ML PCA analysis
> -
>
> Key: SPARK-35598
> URL: https://issues.apache.org/jira/browse/SPARK-35598
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.1.2
>Reporter: Antonio Zammuto
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely 
> no jargon.*
> When performing a PCA, the covariance matrix is calculate first.
>  In the function RowMAtrix.computeCovariance() there is the following code:
> {code:java}
> if (rows.first().isInstanceOf[DenseVector]) {
>   computeDenseVectorCovariance(mean, n, m)
> } else {
>   computeSparseVectorCovariance(mean, n, m)
> }
> {code}
> As can be seen, there are two different function to compute the covariance 
> matrix depending on whether the matrix is sparse or dense. Now, the decision 
> if the matrix is sparse or dense is based only on the type of the first row 
> (DenseVector or SparseVector).
> I propose to calculate the sparsity of the matrix, as ratio between 
> number-of-zeros and total-number-of elements. And use the sparsity to switch 
> between the two functions.
> *Q2. What problem is this proposal NOT designed to solve?*
> The 2 functions give slightly different results. This SPIP doesn't solve this 
> issue
> *Q3. How is it done today, and what are the limits of current practice?*
> Basing on the type of the first rows only is questionable.
> Better would be to base on the entire matrix, and on a mathematical concept, 
> like sparsity.
> *Q4. What is new in your approach and why do you think it will be successful?*
> The code is implemented in a more meaningful way, the decision is done on the 
> entire matrix
> *Q5. Who cares? If you are successful, what difference will it make?*
> More consistent code for PCA analysis, it will be beneficial for who is using 
> the PCA analysis
> *Q6. What are the risks?*
> You still need a threshold to decide if the matrix is sparse or not. There is 
> no universally defined value to define the sparsity. 
>  Therefore the threshold of 50% that I choose is arbitrary, and might not be 
> the best in all cases.
>  Still I consider as an improvement compared to the current implementation.
> *Q7. How long will it take?*
> The change is easy to implement and test
> *Q8. What are the mid-term and final “exams” to check for success?*
> We can check that the sparsity is calculated correctly and that there is no 
> "unexplainable" differences with current implementation.
>  "Unexplainable" meaning, beside the fact that there will be cases where it 
> will now use the computeDenseVectorCovariance whereas before was using the 
> computeSparseVectorCovariance function and vice versa.
> The tests in the RowMatrixSuite are successful.
>  A couple of tests have been added to verify the sparsity is calculated 
> properly.
> *Appendix A. Proposed API Changes. Optional section defining APIs changes, if 
> any. Backward and forward compatibility must be taken into account.*
> defining a function to calculate the sparsity of the RowMatrix
> {code:java}
>   def calcSparsity(): Double = {
> rows.map{ vec => (vec.size-vec.numNonzeros) }.sum/
>   (rows.count() * rows.take(1)(0).size).toDouble
>   }
> {code}
> Changing the switching condition. Before was:
> {code:java}
> if (rows.first().isInstanceOf[DenseVector]) {
>   computeDenseVectorCovariance(mean, n, m)
> } else {
>   computeSparseVectorCovariance(mean, n, m)
> }
> {code}
> Proposed new code is:
> {code:java}
> val sparsityThreshold = 0.5
> val sparsity = calcSparsity()
> if (sparsity   computeDenseVectorCovariance(mean, n, m)
> } else {
>   computeSparseVectorCovariance(mean, n, m)
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35598) Improve Spark-ML PCA analysis

2021-06-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358659#comment-17358659
 ] 

Apache Spark commented on SPARK-35598:
--

User 'Antozamm' has created a pull request for this issue:
https://github.com/apache/spark/pull/32808

> Improve Spark-ML PCA analysis
> -
>
> Key: SPARK-35598
> URL: https://issues.apache.org/jira/browse/SPARK-35598
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.1.2
>Reporter: Antonio Zammuto
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely 
> no jargon.*
> When performing a PCA, the covariance matrix is calculate first.
>  In the function RowMAtrix.computeCovariance() there is the following code:
> {code:java}
> if (rows.first().isInstanceOf[DenseVector]) {
>   computeDenseVectorCovariance(mean, n, m)
> } else {
>   computeSparseVectorCovariance(mean, n, m)
> }
> {code}
> As can be seen, there are two different function to compute the covariance 
> matrix depending on whether the matrix is sparse or dense. Now, the decision 
> if the matrix is sparse or dense is based only on the type of the first row 
> (DenseVector or SparseVector).
> I propose to calculate the sparsity of the matrix, as ratio between 
> number-of-zeros and total-number-of elements. And use the sparsity to switch 
> between the two functions.
> *Q2. What problem is this proposal NOT designed to solve?*
> The 2 functions give slightly different results. This SPIP doesn't solve this 
> issue
> *Q3. How is it done today, and what are the limits of current practice?*
> Basing on the type of the first rows only is questionable.
> Better would be to base on the entire matrix, and on a mathematical concept, 
> like sparsity.
> *Q4. What is new in your approach and why do you think it will be successful?*
> The code is implemented in a more meaningful way, the decision is done on the 
> entire matrix
> *Q5. Who cares? If you are successful, what difference will it make?*
> More consistent code for PCA analysis, it will be beneficial for who is using 
> the PCA analysis
> *Q6. What are the risks?*
> You still need a threshold to decide if the matrix is sparse or not. There is 
> no universally defined value to define the sparsity. 
>  Therefore the threshold of 50% that I choose is arbitrary, and might not be 
> the best in all cases.
>  Still I consider as an improvement compared to the current implementation.
> *Q7. How long will it take?*
> The change is easy to implement and test
> *Q8. What are the mid-term and final “exams” to check for success?*
> We can check that the sparsity is calculated correctly and that there is no 
> "unexplainable" differences with current implementation.
>  "Unexplainable" meaning, beside the fact that there will be cases where it 
> will now use the computeDenseVectorCovariance whereas before was using the 
> computeSparseVectorCovariance function and vice versa.
> The tests in the RowMatrixSuite are successful.
>  A couple of tests have been added to verify the sparsity is calculated 
> properly.
> *Appendix A. Proposed API Changes. Optional section defining APIs changes, if 
> any. Backward and forward compatibility must be taken into account.*
> defining a function to calculate the sparsity of the RowMatrix
> {code:java}
>   def calcSparsity(): Double = {
> rows.map{ vec => (vec.size-vec.numNonzeros) }.sum/
>   (rows.count() * rows.take(1)(0).size).toDouble
>   }
> {code}
> Changing the switching condition. Before was:
> {code:java}
> if (rows.first().isInstanceOf[DenseVector]) {
>   computeDenseVectorCovariance(mean, n, m)
> } else {
>   computeSparseVectorCovariance(mean, n, m)
> }
> {code}
> Proposed new code is:
> {code:java}
> val sparsityThreshold = 0.5
> val sparsity = calcSparsity()
> if (sparsity   computeDenseVectorCovariance(mean, n, m)
> } else {
>   computeSparseVectorCovariance(mean, n, m)
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35598) Improve Spark-ML PCA analysis

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35598:


Assignee: (was: Apache Spark)

> Improve Spark-ML PCA analysis
> -
>
> Key: SPARK-35598
> URL: https://issues.apache.org/jira/browse/SPARK-35598
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.1.2
>Reporter: Antonio Zammuto
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely 
> no jargon.*
> When performing a PCA, the covariance matrix is calculate first.
>  In the function RowMAtrix.computeCovariance() there is the following code:
> {code:java}
> if (rows.first().isInstanceOf[DenseVector]) {
>   computeDenseVectorCovariance(mean, n, m)
> } else {
>   computeSparseVectorCovariance(mean, n, m)
> }
> {code}
> As can be seen, there are two different function to compute the covariance 
> matrix depending on whether the matrix is sparse or dense. Now, the decision 
> if the matrix is sparse or dense is based only on the type of the first row 
> (DenseVector or SparseVector).
> I propose to calculate the sparsity of the matrix, as ratio between 
> number-of-zeros and total-number-of elements. And use the sparsity to switch 
> between the two functions.
> *Q2. What problem is this proposal NOT designed to solve?*
> The 2 functions give slightly different results. This SPIP doesn't solve this 
> issue
> *Q3. How is it done today, and what are the limits of current practice?*
> Basing on the type of the first rows only is questionable.
> Better would be to base on the entire matrix, and on a mathematical concept, 
> like sparsity.
> *Q4. What is new in your approach and why do you think it will be successful?*
> The code is implemented in a more meaningful way, the decision is done on the 
> entire matrix
> *Q5. Who cares? If you are successful, what difference will it make?*
> More consistent code for PCA analysis, it will be beneficial for who is using 
> the PCA analysis
> *Q6. What are the risks?*
> You still need a threshold to decide if the matrix is sparse or not. There is 
> no universally defined value to define the sparsity. 
>  Therefore the threshold of 50% that I choose is arbitrary, and might not be 
> the best in all cases.
>  Still I consider as an improvement compared to the current implementation.
> *Q7. How long will it take?*
> The change is easy to implement and test
> *Q8. What are the mid-term and final “exams” to check for success?*
> We can check that the sparsity is calculated correctly and that there is no 
> "unexplainable" differences with current implementation.
>  "Unexplainable" meaning, beside the fact that there will be cases where it 
> will now use the computeDenseVectorCovariance whereas before was using the 
> computeSparseVectorCovariance function and vice versa.
> The tests in the RowMatrixSuite are successful.
>  A couple of tests have been added to verify the sparsity is calculated 
> properly.
> *Appendix A. Proposed API Changes. Optional section defining APIs changes, if 
> any. Backward and forward compatibility must be taken into account.*
> defining a function to calculate the sparsity of the RowMatrix
> {code:java}
>   def calcSparsity(): Double = {
> rows.map{ vec => (vec.size-vec.numNonzeros) }.sum/
>   (rows.count() * rows.take(1)(0).size).toDouble
>   }
> {code}
> Changing the switching condition. Before was:
> {code:java}
> if (rows.first().isInstanceOf[DenseVector]) {
>   computeDenseVectorCovariance(mean, n, m)
> } else {
>   computeSparseVectorCovariance(mean, n, m)
> }
> {code}
> Proposed new code is:
> {code:java}
> val sparsityThreshold = 0.5
> val sparsity = calcSparsity()
> if (sparsity   computeDenseVectorCovariance(mean, n, m)
> } else {
>   computeSparseVectorCovariance(mean, n, m)
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35598) Improve Spark-ML PCA analysis

2021-06-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358660#comment-17358660
 ] 

Apache Spark commented on SPARK-35598:
--

User 'Antozamm' has created a pull request for this issue:
https://github.com/apache/spark/pull/32808

> Improve Spark-ML PCA analysis
> -
>
> Key: SPARK-35598
> URL: https://issues.apache.org/jira/browse/SPARK-35598
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.1.2
>Reporter: Antonio Zammuto
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely 
> no jargon.*
> When performing a PCA, the covariance matrix is calculate first.
>  In the function RowMAtrix.computeCovariance() there is the following code:
> {code:java}
> if (rows.first().isInstanceOf[DenseVector]) {
>   computeDenseVectorCovariance(mean, n, m)
> } else {
>   computeSparseVectorCovariance(mean, n, m)
> }
> {code}
> As can be seen, there are two different function to compute the covariance 
> matrix depending on whether the matrix is sparse or dense. Now, the decision 
> if the matrix is sparse or dense is based only on the type of the first row 
> (DenseVector or SparseVector).
> I propose to calculate the sparsity of the matrix, as ratio between 
> number-of-zeros and total-number-of elements. And use the sparsity to switch 
> between the two functions.
> *Q2. What problem is this proposal NOT designed to solve?*
> The 2 functions give slightly different results. This SPIP doesn't solve this 
> issue
> *Q3. How is it done today, and what are the limits of current practice?*
> Basing on the type of the first rows only is questionable.
> Better would be to base on the entire matrix, and on a mathematical concept, 
> like sparsity.
> *Q4. What is new in your approach and why do you think it will be successful?*
> The code is implemented in a more meaningful way, the decision is done on the 
> entire matrix
> *Q5. Who cares? If you are successful, what difference will it make?*
> More consistent code for PCA analysis, it will be beneficial for who is using 
> the PCA analysis
> *Q6. What are the risks?*
> You still need a threshold to decide if the matrix is sparse or not. There is 
> no universally defined value to define the sparsity. 
>  Therefore the threshold of 50% that I choose is arbitrary, and might not be 
> the best in all cases.
>  Still I consider as an improvement compared to the current implementation.
> *Q7. How long will it take?*
> The change is easy to implement and test
> *Q8. What are the mid-term and final “exams” to check for success?*
> We can check that the sparsity is calculated correctly and that there is no 
> "unexplainable" differences with current implementation.
>  "Unexplainable" meaning, beside the fact that there will be cases where it 
> will now use the computeDenseVectorCovariance whereas before was using the 
> computeSparseVectorCovariance function and vice versa.
> The tests in the RowMatrixSuite are successful.
>  A couple of tests have been added to verify the sparsity is calculated 
> properly.
> *Appendix A. Proposed API Changes. Optional section defining APIs changes, if 
> any. Backward and forward compatibility must be taken into account.*
> defining a function to calculate the sparsity of the RowMatrix
> {code:java}
>   def calcSparsity(): Double = {
> rows.map{ vec => (vec.size-vec.numNonzeros) }.sum/
>   (rows.count() * rows.take(1)(0).size).toDouble
>   }
> {code}
> Changing the switching condition. Before was:
> {code:java}
> if (rows.first().isInstanceOf[DenseVector]) {
>   computeDenseVectorCovariance(mean, n, m)
> } else {
>   computeSparseVectorCovariance(mean, n, m)
> }
> {code}
> Proposed new code is:
> {code:java}
> val sparsityThreshold = 0.5
> val sparsity = calcSparsity()
> if (sparsity   computeDenseVectorCovariance(mean, n, m)
> } else {
>   computeSparseVectorCovariance(mean, n, m)
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35667) spark.speculation causes incorrect query results with TRANSFORM

2021-06-07 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358707#comment-17358707
 ] 

Erik Krogen commented on SPARK-35667:
-

fyi [~vsowrirajan] [~ron8hu]

> spark.speculation causes incorrect query results with TRANSFORM
> ---
>
> Key: SPARK-35667
> URL: https://issues.apache.org/jira/browse/SPARK-35667
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8
>Reporter: yuanxm
>Priority: Major
>
> SQL as follow gets incorrect results sometimes when spark.speculation is 
> true: 
> {code:java}
> SELECT count(1)
> FROM
>   (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt)
>FROM
>  (SELECT dt
>   FROM test_table)tmpa1)tmpa2{code}
> With spark.speculation=true, the count result is less than the correct one. 
> It's more likely to get incorrect result when there are more speculative 
> tasks. 
> `test.py`:
> {code:java}
> import sys
> for line in sys.stdin:
> line = line.strip()
> arr = line.split()
> print "\t".join(arr){code}
>  
> spark-sql command:
> {code:java}
> ./bin/spark-sql --master yarn \ 
> --conf spark.speculation=true \ 
> --conf spark.shuffle.service.enabled=true \ 
> --conf spark.dynamicAllocation.enabled=true \ 
> --conf spark.dynamicAllocation.executorIdleTimeout=5s \ 
> --conf spark.dynamicAllocation.initialExecutor=1 \ 
> --conf spark.dynamicAllocation.maxExecutors=40
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35662) Support Timestamp without time zone data type

2021-06-07 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-35662:
---
Description: 
Spark SQL today supports the TIMESTAMP data type. However the semantics 
provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. 
Timestamps embedded in a SQL query or passed through JDBC are presumed to be in 
session local timezone and cast to UTC before being processed.
 These are desirable semantics in many cases, such as when dealing with 
calendars.
 In many (more) other cases, such as when dealing with log files it is 
desirable that the provided timestamps not be altered.
 SQL users expect that they can model either behavior and do so by using 
TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH 
LOCAL TIME ZONE for time zone sensitive data.
 Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will 
be surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a feature that does not 
exist in the standard.

In this new feature, we will introduce TIMESTAMP WITH LOCAL TIMEZONE to 
describe the existing timestamp type and add TIMESTAMP WITHOUT TIME ZONE for 
standard semantic.
 Using these two types will provide clarity.
 We will also allow users to set the default behavior for TIMESTAMP to either 
use TIMESTAMP WITH LOCAL TIME ZONE or TIMESTAMP WITHOUT TIME ZONE.
h3. Milestone 1 – Spark Timestamp equivalency ( The new Timestamp type 
TimestampWithoutTZ meets or exceeds all function of the existing SQL Timestamp):
 * Add a new DataType implementation for TimestampWithoutTZ.
 * Support TimestampWithoutTZ in Dataset/UDF.
 * TimestampWithoutTZ literals
 * TimestampWithoutTZ arithmetic(e.g. TimestampWithoutTZ - TimestampWithoutTZ, 
TimestampWithoutTZ - Date)
 * Datetime functions/operators: dayofweek, weekofyear, year, etc
 * Cast to and from TimestampWithoutTZ, cast String/Timestamp to 
TimestampWithoutTZ, cast TimestampWithoutTZ to string (pretty 
printing)/Timestamp, with the SQL syntax to specify the types
 * Support sorting TimestampWithoutTZ.

h3. Milestone 2 – Persistence:
 * Ability to create tables of type TimestampWithoutTZ
 * Ability to write to common file formats such as Parquet and JSON.
 * INSERT, SELECT, UPDATE, MERGE
 * Discovery

h3. Milestone 3 – Client support
 * JDBC support
 * Hive Thrift server

h3. Milestone 4 – PySpark and Spark R integration
 * Python UDF can take and return TimestampWithoutTZ
 * DataFrame support

  was:
Spark SQL today supports the TIMESTAMP data type. However the semantics 
provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. 
Timestamps embedded in a SQL query or passed through JDBC are presumed to be in 
session local timezone and cast to UTC before being processed.
 These are desirable semantics in many cases, such as when dealing with 
calendars.
 In many (more) other cases, such as when dealing with log files it is 
desirable that the provided timestamps not be altered.
 SQL users expect that they can model either behavior and do so by using 
TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH 
LOCAL TIME ZONE for time zone sensitive data.
 Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will 
be surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a feature that does not 
exist in the standard.

In this new feature, we will introduce TIMESTAMP WITH LOCAL TIMEZONE to 
describe the existing timestamp type and add TIMESTAMP WITHOUT TIME ZONE for 
standard semantic.
 Using these two types will provide clarity.
 We will also allow users to set the default behavior for TIMESTAMP to either 
use TIMESTAMP WITH LOCAL TIME ZONE or TIMESTAMP WITHOUT TIME ZONE.
h3. Milestone 1 – Spark Timestamp equivalency ( The new Timestamp type 
TimestampNTZ meets or exceeds all function of the existing SQL Timestamp):
 * Add a new DataType implementation for TimestampNTZ.
 * Support TimestampNTZ in Dataset/UDF.
 * TimestampNTZ literals
 * TimestampNTZ arithmetic(e.g. TimestampNTZ - TimestampNTZ, TimestampNTZ - 
Date)
 * Datetime functions/operators: dayofweek, weekofyear, year, etc
 * Cast to and from TimestampNTZ, cast String/Timestamp to TimestampNTZ, cast 
TimestampNTZ to string (pretty printing)/Timestamp, with the SQL syntax to 
specify the types
 * Support sorting TimestampNTZ.

h3. Milestone 2 – Persistence:
 * Ability to create tables of type TimestampNTZ
 * Ability to write to common file formats such as Parquet and JSON.
 * INSERT, SELECT, UPDATE, MERGE
 * Discovery

h3. Milestone 3 – Client support
 * JDBC support
 * Hive Thrift server

h3. Milestone 4 – PySpark and Spark R integration
 * Python UDF can take and return TimestampNTZ
 * DataFrame support


> Support Timestamp without time zone data type
> -
>
> Key: SPARK-35662
> URL: https://issues.apache

[jira] [Updated] (SPARK-35663) Add Timestamp without time zone type

2021-06-07 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-35663:
---
Description: 
Extend Catalyst's type system by a new type that conform to the SQL standard 
(see SQL:2016, section 4.6.2):

* TimestampWithoutTZ represents the timestamp without time zone type

  was:
Extend Catalyst's type system by a new type that conform to the SQL standard 
(see SQL:2016, section 4.6.2):

* TimestampNTZType represents the timestamp without time zone type


> Add Timestamp without time zone type
> 
>
> Key: SPARK-35663
> URL: https://issues.apache.org/jira/browse/SPARK-35663
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Extend Catalyst's type system by a new type that conform to the SQL standard 
> (see SQL:2016, section 4.6.2):
> * TimestampWithoutTZ represents the timestamp without time zone type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35664) Support java.time. LocalDateTime as an external type of TimestampWithoutTZ type

2021-06-07 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-35664:
---
Summary: Support java.time. LocalDateTime as an external type of 
TimestampWithoutTZ type  (was: Support java.time. LocalDateTime as an external 
type of TimestampNTZ type)

> Support java.time. LocalDateTime as an external type of TimestampWithoutTZ 
> type
> ---
>
> Key: SPARK-35664
> URL: https://issues.apache.org/jira/browse/SPARK-35664
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Allow parallelization/collection of java.time.LocalDateTime values, and 
> convert the values to timestamp values of TimestampNTZType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35664) Support java.time. LocalDateTime as an external type of TimestampWithoutTZ type

2021-06-07 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-35664:
---
Description: Allow parallelization/collection of java.time.LocalDateTime 
values, and convert the values to timestamp values of TimestampWithoutTZType.  
(was: Allow parallelization/collection of java.time.LocalDateTime values, and 
convert the values to timestamp values of TimestampNTZType.)

> Support java.time. LocalDateTime as an external type of TimestampWithoutTZ 
> type
> ---
>
> Key: SPARK-35664
> URL: https://issues.apache.org/jira/browse/SPARK-35664
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Allow parallelization/collection of java.time.LocalDateTime values, and 
> convert the values to timestamp values of TimestampWithoutTZType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35074) spark.jars.xxx configs should be moved to config/package.scala

2021-06-07 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-35074:
-

Assignee: dc-heros

> spark.jars.xxx configs should be moved to config/package.scala
> --
>
> Key: SPARK-35074
> URL: https://issues.apache.org/jira/browse/SPARK-35074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Assignee: dc-heros
>Priority: Trivial
> Fix For: 3.2.0
>
>
> Currently {{spark.jars.xxx}} property keys (e.g. {{spark.jars.ivySettings}} 
> and {{spark.jars.packages}}) are hardcoded in multiple places within Spark 
> code across multiple modules. We should define them in 
> {{config/package.scala}} and reference them in all other places.
> This came up during reviews of SPARK-34472 at 
> https://github.com/apache/spark/pull/31591#discussion_r584848624



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32922) Add support for ShuffleBlockFetcherIterator to read from merged shuffle partitions and to fallback to original shuffle blocks if encountering failures

2021-06-07 Thread Chandni Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358768#comment-17358768
 ] 

Chandni Singh commented on SPARK-32922:
---

Splitting the changes into ESS server side/client side changes as per the 
comment here: https://github.com/apache/spark/pull/32140#issuecomment-856099709

> Add support for ShuffleBlockFetcherIterator to read from merged shuffle 
> partitions and to fallback to original shuffle blocks if encountering failures
> --
>
> Key: SPARK-32922
> URL: https://issues.apache.org/jira/browse/SPARK-32922
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
>
> With the extended MapOutputTracker, the reducers can now get the task input 
> data from the merged shuffle partitions for more efficient shuffle data 
> fetch. The reducers should also be able to fallback to fetching the original 
> unmarked blocks if it encounters failures when fetching the merged shuffle 
> partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35670) Upgrade ZSTD-JNI to 1.5.0-1

2021-06-07 Thread David Christle (Jira)
David Christle created SPARK-35670:
--

 Summary: Upgrade ZSTD-JNI to 1.5.0-1
 Key: SPARK-35670
 URL: https://issues.apache.org/jira/browse/SPARK-35670
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.2.0
Reporter: David Christle






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35671) Add Support in the ESS to serve merged shuffle block meta and data to executors

2021-06-07 Thread Chandni Singh (Jira)
Chandni Singh created SPARK-35671:
-

 Summary: Add Support in the ESS to serve merged shuffle block meta 
and data to executors
 Key: SPARK-35671
 URL: https://issues.apache.org/jira/browse/SPARK-35671
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle
Affects Versions: 3.1.0
Reporter: Chandni Singh


With push-based shuffle enabled, the reducers send 2 different requests to the 
ESS:

1. Request to fetch the metadata of the merged shuffle block.
2. Requests to fetch the data of the merged shuffle block in chunks which are 
by default 2MB in size.
The ESS should be able to serve these requests and this Jira targets all the 
changes in the ESS to be able to support this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35670) Upgrade ZSTD-JNI to 1.5.0-1

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35670:


Assignee: (was: Apache Spark)

> Upgrade ZSTD-JNI to 1.5.0-1
> ---
>
> Key: SPARK-35670
> URL: https://issues.apache.org/jira/browse/SPARK-35670
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: David Christle
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35670) Upgrade ZSTD-JNI to 1.5.0-1

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35670:


Assignee: Apache Spark

> Upgrade ZSTD-JNI to 1.5.0-1
> ---
>
> Key: SPARK-35670
> URL: https://issues.apache.org/jira/browse/SPARK-35670
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: David Christle
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35670) Upgrade ZSTD-JNI to 1.5.0-1

2021-06-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358786#comment-17358786
 ] 

Apache Spark commented on SPARK-35670:
--

User 'dchristle' has created a pull request for this issue:
https://github.com/apache/spark/pull/32809

> Upgrade ZSTD-JNI to 1.5.0-1
> ---
>
> Key: SPARK-35670
> URL: https://issues.apache.org/jira/browse/SPARK-35670
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: David Christle
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35670) Upgrade ZSTD-JNI to 1.5.0-1

2021-06-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358788#comment-17358788
 ] 

Apache Spark commented on SPARK-35670:
--

User 'dchristle' has created a pull request for this issue:
https://github.com/apache/spark/pull/32809

> Upgrade ZSTD-JNI to 1.5.0-1
> ---
>
> Key: SPARK-35670
> URL: https://issues.apache.org/jira/browse/SPARK-35670
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: David Christle
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35671) Add Support in the ESS to serve merged shuffle block meta and data to executors

2021-06-07 Thread Chandni Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated SPARK-35671:
--
Description: 
With push-based shuffle enabled, the reducers send 2 different requests to the 
ESS:

1. Request to fetch the metadata of the merged shuffle block.
 2. Requests to fetch the data of the merged shuffle block in chunks which are 
by default 2MB in size.
 The ESS should be able to serve these requests and this Jira targets all the 
changes in the network-common and network-shuffle modules to be able to support 
this.

  was:
With push-based shuffle enabled, the reducers send 2 different requests to the 
ESS:

1. Request to fetch the metadata of the merged shuffle block.
2. Requests to fetch the data of the merged shuffle block in chunks which are 
by default 2MB in size.
The ESS should be able to serve these requests and this Jira targets all the 
changes in the ESS to be able to support this.


> Add Support in the ESS to serve merged shuffle block meta and data to 
> executors
> ---
>
> Key: SPARK-35671
> URL: https://issues.apache.org/jira/browse/SPARK-35671
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Major
>
> With push-based shuffle enabled, the reducers send 2 different requests to 
> the ESS:
> 1. Request to fetch the metadata of the merged shuffle block.
>  2. Requests to fetch the data of the merged shuffle block in chunks which 
> are by default 2MB in size.
>  The ESS should be able to serve these requests and this Jira targets all the 
> changes in the network-common and network-shuffle modules to be able to 
> support this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34791) SparkR throws node stack overflow

2021-06-07 Thread obfuscated_dvlper (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358839#comment-17358839
 ] 

obfuscated_dvlper commented on SPARK-34791:
---

hi team, any updates on this. tried with Spark 3.1.2 and still getting Error: 
node stack overflow. looks like this  
[https://github.com/apache/spark/pull/26429#discussion_r346103050] was never 
addressed?

> SparkR throws node stack overflow
> -
>
> Key: SPARK-34791
> URL: https://issues.apache.org/jira/browse/SPARK-34791
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 3.0.1
>Reporter: obfuscated_dvlper
>Priority: Major
>
> SparkR throws "node stack overflow" error upon running code (sample below) on 
> R-4.0.2 with Spark 3.0.1.
> Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5)
> {code:java}
> source('sample.R')
> myclsr = myclosure_func()
> myclsr$get_some_date('2021-01-01')
> ## spark.lapply throws node stack overflow
> result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) {
> source('sample.R')
> another_closure = myclosure_func()
> return(another_closure$get_some_date(rdate))
> })
> {code}
> Sample.R
> {code:java}
> ## util function, which calls itself
> getPreviousBusinessDate <- function(asofdate) {
> asdt <- asofdate;
> asdt <- as.Date(asofdate)-1;
> wd <- format(as.Date(asdt),"%A")
> if(wd == "Saturday" | wd == "Sunday") {
> return (getPreviousBusinessDate(asdt));
> }
> return (asdt);
> }
> ## closure which calls util function
> myclosure_func = function() {
> myclosure = list()
> get_some_date = function (random_date) {
> return (getPreviousBusinessDate(random_date))
> }
> myclosure$get_some_date = get_some_date
> return(myclosure)
> }
> {code}
> This seems to have caused by sourcing sample.R twice. Once before invoking 
> Spark session and another within Spark session.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35672) Spark fails to launch executors with very large user classpath lists on YARN

2021-06-07 Thread Erik Krogen (Jira)
Erik Krogen created SPARK-35672:
---

 Summary: Spark fails to launch executors with very large user 
classpath lists on YARN
 Key: SPARK-35672
 URL: https://issues.apache.org/jira/browse/SPARK-35672
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 3.1.2
 Environment: Linux RHEL7
Spark 3.1.1
Reporter: Erik Krogen


When running Spark on YARN, the {{user-class-path}} argument to 
{{CoarseGrainedExecutorBackend}} is used to pass a list of user JAR URIs to 
executor processes. The argument is specified once for each JAR, and the URIs 
are fully-qualified, so the paths can be quite long. With large user JAR lists 
(say 1000+), this can result in system-level argument length limits being 
exceeded, typically manifesting as the error message:
{code}
/bin/bash: Argument list too long
{code}

A [Google 
search|https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22]
 indicates that this is not a theoretical problem and afflicts real users, 
including ours. This issue was originally observed on Spark 2.3, but has been 
confirmed to exist in the master branch as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35343) Make the conversion from/to pandas data-type-based for non-ExtensionDtypes

2021-06-07 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-35343.
---
Fix Version/s: 3.2.0
 Assignee: Xinrong Meng
   Resolution: Fixed

Issue resolved by pull request 32592
https://github.com/apache/spark/pull/32592

> Make the conversion from/to pandas data-type-based for non-ExtensionDtypes
> --
>
> Key: SPARK-35343
> URL: https://issues.apache.org/jira/browse/SPARK-35343
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0
>
>
> The conversion from/to pandas includes logic for checking data types and 
> behaving accordingly.
> That makes code hard to change or maintain.
> Since we have introduced the Ops class per non-ExtensionDtypes data type, we 
> ought to make the conversion from/to pandas data-type-based.
> Ops class per ExtensionDtype and its data-type-based from/to pandas will be 
> implemented in a separate PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35672) Spark fails to launch executors with very large user classpath lists on YARN

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35672:


Assignee: (was: Apache Spark)

> Spark fails to launch executors with very large user classpath lists on YARN
> 
>
> Key: SPARK-35672
> URL: https://issues.apache.org/jira/browse/SPARK-35672
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.1.2
> Environment: Linux RHEL7
> Spark 3.1.1
>Reporter: Erik Krogen
>Priority: Major
>
> When running Spark on YARN, the {{user-class-path}} argument to 
> {{CoarseGrainedExecutorBackend}} is used to pass a list of user JAR URIs to 
> executor processes. The argument is specified once for each JAR, and the URIs 
> are fully-qualified, so the paths can be quite long. With large user JAR 
> lists (say 1000+), this can result in system-level argument length limits 
> being exceeded, typically manifesting as the error message:
> {code}
> /bin/bash: Argument list too long
> {code}
> A [Google 
> search|https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22]
>  indicates that this is not a theoretical problem and afflicts real users, 
> including ours. This issue was originally observed on Spark 2.3, but has been 
> confirmed to exist in the master branch as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35672) Spark fails to launch executors with very large user classpath lists on YARN

2021-06-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358849#comment-17358849
 ] 

Apache Spark commented on SPARK-35672:
--

User 'xkrogen' has created a pull request for this issue:
https://github.com/apache/spark/pull/32810

> Spark fails to launch executors with very large user classpath lists on YARN
> 
>
> Key: SPARK-35672
> URL: https://issues.apache.org/jira/browse/SPARK-35672
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.1.2
> Environment: Linux RHEL7
> Spark 3.1.1
>Reporter: Erik Krogen
>Priority: Major
>
> When running Spark on YARN, the {{user-class-path}} argument to 
> {{CoarseGrainedExecutorBackend}} is used to pass a list of user JAR URIs to 
> executor processes. The argument is specified once for each JAR, and the URIs 
> are fully-qualified, so the paths can be quite long. With large user JAR 
> lists (say 1000+), this can result in system-level argument length limits 
> being exceeded, typically manifesting as the error message:
> {code}
> /bin/bash: Argument list too long
> {code}
> A [Google 
> search|https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22]
>  indicates that this is not a theoretical problem and afflicts real users, 
> including ours. This issue was originally observed on Spark 2.3, but has been 
> confirmed to exist in the master branch as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35672) Spark fails to launch executors with very large user classpath lists on YARN

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35672:


Assignee: Apache Spark

> Spark fails to launch executors with very large user classpath lists on YARN
> 
>
> Key: SPARK-35672
> URL: https://issues.apache.org/jira/browse/SPARK-35672
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.1.2
> Environment: Linux RHEL7
> Spark 3.1.1
>Reporter: Erik Krogen
>Assignee: Apache Spark
>Priority: Major
>
> When running Spark on YARN, the {{user-class-path}} argument to 
> {{CoarseGrainedExecutorBackend}} is used to pass a list of user JAR URIs to 
> executor processes. The argument is specified once for each JAR, and the URIs 
> are fully-qualified, so the paths can be quite long. With large user JAR 
> lists (say 1000+), this can result in system-level argument length limits 
> being exceeded, typically manifesting as the error message:
> {code}
> /bin/bash: Argument list too long
> {code}
> A [Google 
> search|https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22]
>  indicates that this is not a theoretical problem and afflicts real users, 
> including ours. This issue was originally observed on Spark 2.3, but has been 
> confirmed to exist in the master branch as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35671) Add Support in the ESS to serve merged shuffle block meta and data to executors

2021-06-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358885#comment-17358885
 ] 

Apache Spark commented on SPARK-35671:
--

User 'otterc' has created a pull request for this issue:
https://github.com/apache/spark/pull/32811

> Add Support in the ESS to serve merged shuffle block meta and data to 
> executors
> ---
>
> Key: SPARK-35671
> URL: https://issues.apache.org/jira/browse/SPARK-35671
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Major
>
> With push-based shuffle enabled, the reducers send 2 different requests to 
> the ESS:
> 1. Request to fetch the metadata of the merged shuffle block.
>  2. Requests to fetch the data of the merged shuffle block in chunks which 
> are by default 2MB in size.
>  The ESS should be able to serve these requests and this Jira targets all the 
> changes in the network-common and network-shuffle modules to be able to 
> support this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35671) Add Support in the ESS to serve merged shuffle block meta and data to executors

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35671:


Assignee: Apache Spark

> Add Support in the ESS to serve merged shuffle block meta and data to 
> executors
> ---
>
> Key: SPARK-35671
> URL: https://issues.apache.org/jira/browse/SPARK-35671
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Assignee: Apache Spark
>Priority: Major
>
> With push-based shuffle enabled, the reducers send 2 different requests to 
> the ESS:
> 1. Request to fetch the metadata of the merged shuffle block.
>  2. Requests to fetch the data of the merged shuffle block in chunks which 
> are by default 2MB in size.
>  The ESS should be able to serve these requests and this Jira targets all the 
> changes in the network-common and network-shuffle modules to be able to 
> support this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35671) Add Support in the ESS to serve merged shuffle block meta and data to executors

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35671:


Assignee: (was: Apache Spark)

> Add Support in the ESS to serve merged shuffle block meta and data to 
> executors
> ---
>
> Key: SPARK-35671
> URL: https://issues.apache.org/jira/browse/SPARK-35671
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Priority: Major
>
> With push-based shuffle enabled, the reducers send 2 different requests to 
> the ESS:
> 1. Request to fetch the metadata of the merged shuffle block.
>  2. Requests to fetch the data of the merged shuffle block in chunks which 
> are by default 2MB in size.
>  The ESS should be able to serve these requests and this Jira targets all the 
> changes in the network-common and network-shuffle modules to be able to 
> support this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35341) Introduce BooleanExtensionOps

2021-06-07 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-35341.
---
Fix Version/s: 3.2.0
 Assignee: Xinrong Meng
   Resolution: Fixed

Issue resolved by pull request 32698
https://github.com/apache/spark/pull/32698

> Introduce BooleanExtensionOps
> -
>
> Key: SPARK-35341
> URL: https://issues.apache.org/jira/browse/SPARK-35341
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0
>
>
> {{Now ___and, __or,_ ___rand, and __ror___ are not data type 
> based.}}
> So we would like to introduce these operators for Boolean spark type IndexOps.
> These bitwise operators should be able to apply to other data types classes. 
> However, this is not the goal of this PR.
> extension_dtypes process these operators differently from the rest of the 
> types, so BooleanExtensionOps is introduced.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34591) Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to d

2021-06-07 Thread Julian King (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julian King updated SPARK-34591:

Attachment: (was: Reproducible example of Spark bug.pdf)

> Pyspark undertakes pruning of decision trees and random forests outside the 
> control of the user, leading to undesirable and unexpected outcomes that are 
> challenging to diagnose and impossible to correct
> --
>
> Key: SPARK-34591
> URL: https://issues.apache.org/jira/browse/SPARK-34591
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.4, 3.1.1
>Reporter: Julian King
>Priority: Major
>  Labels: pyspark
> Attachments: Reproducible example of Spark bug.pdf
>
>
> *History of the issue*
> SPARK-3159 implemented a method designed to reduce the computational burden 
> for predictions from decision trees and random forests by pruning the tree 
> after fitting. This is done in such a way that branches where child leaves 
> all produce the same classification prediction are merged.
> This was implemented via a PR: [https://github.com/apache/spark/pull/20632]
> This feature is controllable by a "prune" parameter in the Scala version of 
> the code, which is set to True as the default behaviour. However, this 
> parameter is not exposed in the Pyspark API, resulting in the behaviour above:
>  * Occurring always (despite the user may not wanting it to occur)
>  * Not being documented in the ML documentation, leading to decision tree 
> behavoiur that may be in conflict with what the user expects to happen
> *Why is this a problem?*
> +Problem 1: Inaccurate probabilities+
> Because the decision to prune is based on the classification prediction from 
> the tree (not the probability prediction from the node), this introduces 
> additional bias compared to the situation where the pruning is not done. The 
> impact here may be severe in some cases
> +Problem 2: Leads to completely unacceptable behaviours in some circumstances 
> and for some hyper-parameters+
> My colleagues and I encountered this bug in a scenario where we could not get 
> a decision tree classifier (or random forest classifier with a single tree) 
> to split a single node, despite this being eminently supported by the data. 
> This renders the decision trees and random forests complete unusable
> +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and 
> how they interact with the data+
> Small changes in the hyper-parameters should ideally produce small changes in 
> the built trees. However, here we have found that small changes in the 
> hyper-parameters lead to large and unpredictable changes in the resultant 
> trees as a result of this pruning.
> In principle, this high degree of instability means that re-training the same 
> model, with the same hyper-parameter settings, on slightly different data may 
> lead to large variations in the tree structure simply as a result of the 
> pruning
> +Problem 4: The problems above are much worse for unbalanced data sets+
> Probability estimation on unbalanced data sets using trees should be 
> supported, but the pruning method described will make this very difficult
> +Problem 5: This pruning method is a substantial variation from the 
> description of the decision tree algorithm in the MLLib documents and is not 
> described+
> This made it extremely confusing for us in working out why we were seeing 
> certain behaviours - we had to trace back through all of the Spark detailed 
> release notes to identify where the problem might.
> *Proposed solutions*
> +Option 1 (much easier):+
> The proposed solution here is:
>  * Set the default pruning behaviour to False rather than True, thereby 
> bringing the default behaviour back into alignment with the documentation 
> whilst avoiding the issues described above
> +Option 2 (more involved):+
> The proposed solution here is:
>  * Leave the default pruning behaviour set to False
>  * Expand the pyspark API to expose the pruning behaviour as a 
> user-controllable option
>  * Document the change to the API
>  * Document the change to the tree building behaviour at appropriate points 
> in the Spark ML and Spark MLLib documentation
> We recommend that the default behaviour be set to False because this approach 
> is not the generally understood approach for building decision trees, where 
> pruning is decided a separate and user-controllable step.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For ad

[jira] [Updated] (SPARK-34591) Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to d

2021-06-07 Thread Julian King (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julian King updated SPARK-34591:

Priority: Major  (was: Minor)

> Pyspark undertakes pruning of decision trees and random forests outside the 
> control of the user, leading to undesirable and unexpected outcomes that are 
> challenging to diagnose and impossible to correct
> --
>
> Key: SPARK-34591
> URL: https://issues.apache.org/jira/browse/SPARK-34591
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.4, 3.1.1
>Reporter: Julian King
>Priority: Major
>  Labels: pyspark
> Attachments: Reproducible example of Spark bug.pdf
>
>
> *History of the issue*
> SPARK-3159 implemented a method designed to reduce the computational burden 
> for predictions from decision trees and random forests by pruning the tree 
> after fitting. This is done in such a way that branches where child leaves 
> all produce the same classification prediction are merged.
> This was implemented via a PR: [https://github.com/apache/spark/pull/20632]
> This feature is controllable by a "prune" parameter in the Scala version of 
> the code, which is set to True as the default behaviour. However, this 
> parameter is not exposed in the Pyspark API, resulting in the behaviour above:
>  * Occurring always (despite the user may not wanting it to occur)
>  * Not being documented in the ML documentation, leading to decision tree 
> behavoiur that may be in conflict with what the user expects to happen
> *Why is this a problem?*
> +Problem 1: Inaccurate probabilities+
> Because the decision to prune is based on the classification prediction from 
> the tree (not the probability prediction from the node), this introduces 
> additional bias compared to the situation where the pruning is not done. The 
> impact here may be severe in some cases
> +Problem 2: Leads to completely unacceptable behaviours in some circumstances 
> and for some hyper-parameters+
> My colleagues and I encountered this bug in a scenario where we could not get 
> a decision tree classifier (or random forest classifier with a single tree) 
> to split a single node, despite this being eminently supported by the data. 
> This renders the decision trees and random forests complete unusable
> +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and 
> how they interact with the data+
> Small changes in the hyper-parameters should ideally produce small changes in 
> the built trees. However, here we have found that small changes in the 
> hyper-parameters lead to large and unpredictable changes in the resultant 
> trees as a result of this pruning.
> In principle, this high degree of instability means that re-training the same 
> model, with the same hyper-parameter settings, on slightly different data may 
> lead to large variations in the tree structure simply as a result of the 
> pruning
> +Problem 4: The problems above are much worse for unbalanced data sets+
> Probability estimation on unbalanced data sets using trees should be 
> supported, but the pruning method described will make this very difficult
> +Problem 5: This pruning method is a substantial variation from the 
> description of the decision tree algorithm in the MLLib documents and is not 
> described+
> This made it extremely confusing for us in working out why we were seeing 
> certain behaviours - we had to trace back through all of the Spark detailed 
> release notes to identify where the problem might.
> *Proposed solutions*
> +Option 1 (much easier):+
> The proposed solution here is:
>  * Set the default pruning behaviour to False rather than True, thereby 
> bringing the default behaviour back into alignment with the documentation 
> whilst avoiding the issues described above
> +Option 2 (more involved):+
> The proposed solution here is:
>  * Leave the default pruning behaviour set to False
>  * Expand the pyspark API to expose the pruning behaviour as a 
> user-controllable option
>  * Document the change to the API
>  * Document the change to the tree building behaviour at appropriate points 
> in the Spark ML and Spark MLLib documentation
> We recommend that the default behaviour be set to False because this approach 
> is not the generally understood approach for building decision trees, where 
> pruning is decided a separate and user-controllable step.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issu

[jira] [Updated] (SPARK-34591) Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to d

2021-06-07 Thread Julian King (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julian King updated SPARK-34591:

Attachment: Reproducible example of Spark bug.pdf

> Pyspark undertakes pruning of decision trees and random forests outside the 
> control of the user, leading to undesirable and unexpected outcomes that are 
> challenging to diagnose and impossible to correct
> --
>
> Key: SPARK-34591
> URL: https://issues.apache.org/jira/browse/SPARK-34591
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.4, 3.1.1
>Reporter: Julian King
>Priority: Major
>  Labels: pyspark
> Attachments: Reproducible example of Spark bug.pdf
>
>
> *History of the issue*
> SPARK-3159 implemented a method designed to reduce the computational burden 
> for predictions from decision trees and random forests by pruning the tree 
> after fitting. This is done in such a way that branches where child leaves 
> all produce the same classification prediction are merged.
> This was implemented via a PR: [https://github.com/apache/spark/pull/20632]
> This feature is controllable by a "prune" parameter in the Scala version of 
> the code, which is set to True as the default behaviour. However, this 
> parameter is not exposed in the Pyspark API, resulting in the behaviour above:
>  * Occurring always (despite the user may not wanting it to occur)
>  * Not being documented in the ML documentation, leading to decision tree 
> behavoiur that may be in conflict with what the user expects to happen
> *Why is this a problem?*
> +Problem 1: Inaccurate probabilities+
> Because the decision to prune is based on the classification prediction from 
> the tree (not the probability prediction from the node), this introduces 
> additional bias compared to the situation where the pruning is not done. The 
> impact here may be severe in some cases
> +Problem 2: Leads to completely unacceptable behaviours in some circumstances 
> and for some hyper-parameters+
> My colleagues and I encountered this bug in a scenario where we could not get 
> a decision tree classifier (or random forest classifier with a single tree) 
> to split a single node, despite this being eminently supported by the data. 
> This renders the decision trees and random forests complete unusable
> +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and 
> how they interact with the data+
> Small changes in the hyper-parameters should ideally produce small changes in 
> the built trees. However, here we have found that small changes in the 
> hyper-parameters lead to large and unpredictable changes in the resultant 
> trees as a result of this pruning.
> In principle, this high degree of instability means that re-training the same 
> model, with the same hyper-parameter settings, on slightly different data may 
> lead to large variations in the tree structure simply as a result of the 
> pruning
> +Problem 4: The problems above are much worse for unbalanced data sets+
> Probability estimation on unbalanced data sets using trees should be 
> supported, but the pruning method described will make this very difficult
> +Problem 5: This pruning method is a substantial variation from the 
> description of the decision tree algorithm in the MLLib documents and is not 
> described+
> This made it extremely confusing for us in working out why we were seeing 
> certain behaviours - we had to trace back through all of the Spark detailed 
> release notes to identify where the problem might.
> *Proposed solutions*
> +Option 1 (much easier):+
> The proposed solution here is:
>  * Set the default pruning behaviour to False rather than True, thereby 
> bringing the default behaviour back into alignment with the documentation 
> whilst avoiding the issues described above
> +Option 2 (more involved):+
> The proposed solution here is:
>  * Leave the default pruning behaviour set to False
>  * Expand the pyspark API to expose the pruning behaviour as a 
> user-controllable option
>  * Document the change to the API
>  * Document the change to the tree building behaviour at appropriate points 
> in the Spark ML and Spark MLLib documentation
> We recommend that the default behaviour be set to False because this approach 
> is not the generally understood approach for building decision trees, where 
> pruning is decided a separate and user-controllable step.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional co

[jira] [Updated] (SPARK-34591) Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to d

2021-06-07 Thread Julian King (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julian King updated SPARK-34591:

Attachment: Reproducible example of Spark bug.pdf

> Pyspark undertakes pruning of decision trees and random forests outside the 
> control of the user, leading to undesirable and unexpected outcomes that are 
> challenging to diagnose and impossible to correct
> --
>
> Key: SPARK-34591
> URL: https://issues.apache.org/jira/browse/SPARK-34591
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.4, 3.1.1
>Reporter: Julian King
>Priority: Major
>  Labels: pyspark
> Attachments: Reproducible example of Spark bug.pdf
>
>
> *History of the issue*
> SPARK-3159 implemented a method designed to reduce the computational burden 
> for predictions from decision trees and random forests by pruning the tree 
> after fitting. This is done in such a way that branches where child leaves 
> all produce the same classification prediction are merged.
> This was implemented via a PR: [https://github.com/apache/spark/pull/20632]
> This feature is controllable by a "prune" parameter in the Scala version of 
> the code, which is set to True as the default behaviour. However, this 
> parameter is not exposed in the Pyspark API, resulting in the behaviour above:
>  * Occurring always (despite the user may not wanting it to occur)
>  * Not being documented in the ML documentation, leading to decision tree 
> behavoiur that may be in conflict with what the user expects to happen
> *Why is this a problem?*
> +Problem 1: Inaccurate probabilities+
> Because the decision to prune is based on the classification prediction from 
> the tree (not the probability prediction from the node), this introduces 
> additional bias compared to the situation where the pruning is not done. The 
> impact here may be severe in some cases
> +Problem 2: Leads to completely unacceptable behaviours in some circumstances 
> and for some hyper-parameters+
> My colleagues and I encountered this bug in a scenario where we could not get 
> a decision tree classifier (or random forest classifier with a single tree) 
> to split a single node, despite this being eminently supported by the data. 
> This renders the decision trees and random forests complete unusable
> +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and 
> how they interact with the data+
> Small changes in the hyper-parameters should ideally produce small changes in 
> the built trees. However, here we have found that small changes in the 
> hyper-parameters lead to large and unpredictable changes in the resultant 
> trees as a result of this pruning.
> In principle, this high degree of instability means that re-training the same 
> model, with the same hyper-parameter settings, on slightly different data may 
> lead to large variations in the tree structure simply as a result of the 
> pruning
> +Problem 4: The problems above are much worse for unbalanced data sets+
> Probability estimation on unbalanced data sets using trees should be 
> supported, but the pruning method described will make this very difficult
> +Problem 5: This pruning method is a substantial variation from the 
> description of the decision tree algorithm in the MLLib documents and is not 
> described+
> This made it extremely confusing for us in working out why we were seeing 
> certain behaviours - we had to trace back through all of the Spark detailed 
> release notes to identify where the problem might.
> *Proposed solutions*
> +Option 1 (much easier):+
> The proposed solution here is:
>  * Set the default pruning behaviour to False rather than True, thereby 
> bringing the default behaviour back into alignment with the documentation 
> whilst avoiding the issues described above
> +Option 2 (more involved):+
> The proposed solution here is:
>  * Leave the default pruning behaviour set to False
>  * Expand the pyspark API to expose the pruning behaviour as a 
> user-controllable option
>  * Document the change to the API
>  * Document the change to the tree building behaviour at appropriate points 
> in the Spark ML and Spark MLLib documentation
> We recommend that the default behaviour be set to False because this approach 
> is not the generally understood approach for building decision trees, where 
> pruning is decided a separate and user-controllable step.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional co

[jira] [Commented] (SPARK-34591) Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to

2021-06-07 Thread Julian King (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358953#comment-17358953
 ] 

Julian King commented on SPARK-34591:
-

Here is a reproducible example of this bug which demonstrates a "maximally 
worst" outcome, ie the tree has no splits whatsoever. 

[^Reproducible example of Spark bug.pdf]

 

[~srowen], I disagree about this being minor. Based on this, I consider both 
DecisionTreeClassifier and RandomForestClassifier to be functionally broken. 

> Pyspark undertakes pruning of decision trees and random forests outside the 
> control of the user, leading to undesirable and unexpected outcomes that are 
> challenging to diagnose and impossible to correct
> --
>
> Key: SPARK-34591
> URL: https://issues.apache.org/jira/browse/SPARK-34591
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.4, 3.1.1
>Reporter: Julian King
>Priority: Major
>  Labels: pyspark
> Attachments: Reproducible example of Spark bug.pdf
>
>
> *History of the issue*
> SPARK-3159 implemented a method designed to reduce the computational burden 
> for predictions from decision trees and random forests by pruning the tree 
> after fitting. This is done in such a way that branches where child leaves 
> all produce the same classification prediction are merged.
> This was implemented via a PR: [https://github.com/apache/spark/pull/20632]
> This feature is controllable by a "prune" parameter in the Scala version of 
> the code, which is set to True as the default behaviour. However, this 
> parameter is not exposed in the Pyspark API, resulting in the behaviour above:
>  * Occurring always (despite the user may not wanting it to occur)
>  * Not being documented in the ML documentation, leading to decision tree 
> behavoiur that may be in conflict with what the user expects to happen
> *Why is this a problem?*
> +Problem 1: Inaccurate probabilities+
> Because the decision to prune is based on the classification prediction from 
> the tree (not the probability prediction from the node), this introduces 
> additional bias compared to the situation where the pruning is not done. The 
> impact here may be severe in some cases
> +Problem 2: Leads to completely unacceptable behaviours in some circumstances 
> and for some hyper-parameters+
> My colleagues and I encountered this bug in a scenario where we could not get 
> a decision tree classifier (or random forest classifier with a single tree) 
> to split a single node, despite this being eminently supported by the data. 
> This renders the decision trees and random forests complete unusable
> +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and 
> how they interact with the data+
> Small changes in the hyper-parameters should ideally produce small changes in 
> the built trees. However, here we have found that small changes in the 
> hyper-parameters lead to large and unpredictable changes in the resultant 
> trees as a result of this pruning.
> In principle, this high degree of instability means that re-training the same 
> model, with the same hyper-parameter settings, on slightly different data may 
> lead to large variations in the tree structure simply as a result of the 
> pruning
> +Problem 4: The problems above are much worse for unbalanced data sets+
> Probability estimation on unbalanced data sets using trees should be 
> supported, but the pruning method described will make this very difficult
> +Problem 5: This pruning method is a substantial variation from the 
> description of the decision tree algorithm in the MLLib documents and is not 
> described+
> This made it extremely confusing for us in working out why we were seeing 
> certain behaviours - we had to trace back through all of the Spark detailed 
> release notes to identify where the problem might.
> *Proposed solutions*
> +Option 1 (much easier):+
> The proposed solution here is:
>  * Set the default pruning behaviour to False rather than True, thereby 
> bringing the default behaviour back into alignment with the documentation 
> whilst avoiding the issues described above
> +Option 2 (more involved):+
> The proposed solution here is:
>  * Leave the default pruning behaviour set to False
>  * Expand the pyspark API to expose the pruning behaviour as a 
> user-controllable option
>  * Document the change to the API
>  * Document the change to the tree building behaviour at appropriate points 
> in the Spark ML and Spark MLLib documentation
> We recommend that the default behaviour be set to False because this approach 
> is not the gen

[jira] [Commented] (SPARK-34591) Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to

2021-06-07 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358955#comment-17358955
 ] 

Sean R. Owen commented on SPARK-34591:
--

I think it's OK to at least expose the parameter in Pyspark, if that seems to 
address it.
Off the top of my head, in this example, there is no real signal as the 
features and label are random. It's perhaps just not able to find any way to 
split the data, given defaults, that results in progress in predicting the 
label?

> Pyspark undertakes pruning of decision trees and random forests outside the 
> control of the user, leading to undesirable and unexpected outcomes that are 
> challenging to diagnose and impossible to correct
> --
>
> Key: SPARK-34591
> URL: https://issues.apache.org/jira/browse/SPARK-34591
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.4, 3.1.1
>Reporter: Julian King
>Priority: Major
>  Labels: pyspark
> Attachments: Reproducible example of Spark bug.pdf
>
>
> *History of the issue*
> SPARK-3159 implemented a method designed to reduce the computational burden 
> for predictions from decision trees and random forests by pruning the tree 
> after fitting. This is done in such a way that branches where child leaves 
> all produce the same classification prediction are merged.
> This was implemented via a PR: [https://github.com/apache/spark/pull/20632]
> This feature is controllable by a "prune" parameter in the Scala version of 
> the code, which is set to True as the default behaviour. However, this 
> parameter is not exposed in the Pyspark API, resulting in the behaviour above:
>  * Occurring always (despite the user may not wanting it to occur)
>  * Not being documented in the ML documentation, leading to decision tree 
> behavoiur that may be in conflict with what the user expects to happen
> *Why is this a problem?*
> +Problem 1: Inaccurate probabilities+
> Because the decision to prune is based on the classification prediction from 
> the tree (not the probability prediction from the node), this introduces 
> additional bias compared to the situation where the pruning is not done. The 
> impact here may be severe in some cases
> +Problem 2: Leads to completely unacceptable behaviours in some circumstances 
> and for some hyper-parameters+
> My colleagues and I encountered this bug in a scenario where we could not get 
> a decision tree classifier (or random forest classifier with a single tree) 
> to split a single node, despite this being eminently supported by the data. 
> This renders the decision trees and random forests complete unusable
> +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and 
> how they interact with the data+
> Small changes in the hyper-parameters should ideally produce small changes in 
> the built trees. However, here we have found that small changes in the 
> hyper-parameters lead to large and unpredictable changes in the resultant 
> trees as a result of this pruning.
> In principle, this high degree of instability means that re-training the same 
> model, with the same hyper-parameter settings, on slightly different data may 
> lead to large variations in the tree structure simply as a result of the 
> pruning
> +Problem 4: The problems above are much worse for unbalanced data sets+
> Probability estimation on unbalanced data sets using trees should be 
> supported, but the pruning method described will make this very difficult
> +Problem 5: This pruning method is a substantial variation from the 
> description of the decision tree algorithm in the MLLib documents and is not 
> described+
> This made it extremely confusing for us in working out why we were seeing 
> certain behaviours - we had to trace back through all of the Spark detailed 
> release notes to identify where the problem might.
> *Proposed solutions*
> +Option 1 (much easier):+
> The proposed solution here is:
>  * Set the default pruning behaviour to False rather than True, thereby 
> bringing the default behaviour back into alignment with the documentation 
> whilst avoiding the issues described above
> +Option 2 (more involved):+
> The proposed solution here is:
>  * Leave the default pruning behaviour set to False
>  * Expand the pyspark API to expose the pruning behaviour as a 
> user-controllable option
>  * Document the change to the API
>  * Document the change to the tree building behaviour at appropriate points 
> in the Spark ML and Spark MLLib documentation
> We recommend that the default behaviour be set to False because this approach 
> is not the generally

[jira] [Commented] (SPARK-34591) Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to

2021-06-07 Thread Julian King (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358956#comment-17358956
 ] 

Julian King commented on SPARK-34591:
-

The fact that there's no signal isn't the issue. The issue is that Spark is 
undertaking modifications to the tree in a way which are not consistent with 
what the documentation describes it does.

This is a trivial example with no signal, but I've seen this on actual data 
sets that definitely have signal (that I cannot share for confidentiality 
reasons) with the same behaviour. 

Mechanically, decision trees can always keep splitting until there is only one 
data point per node. Using either the entropy or gini loss functions 
mathematically guarantees that you can find a split with an improved objective 
function in the sum of the children vs the parent. 

[~CBribiescas] has confirmed that when you disable the pruning parameter you 
get the desired behaviour. We will endeavour to submit a pull request soon. 

The original PR for SPARK-3159 should never have been accepted in the first 
place because it modifies the DT algo in a non-trivial way, and this change was 
not described in the Spark documentation.

> Pyspark undertakes pruning of decision trees and random forests outside the 
> control of the user, leading to undesirable and unexpected outcomes that are 
> challenging to diagnose and impossible to correct
> --
>
> Key: SPARK-34591
> URL: https://issues.apache.org/jira/browse/SPARK-34591
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.4, 3.1.1
>Reporter: Julian King
>Priority: Major
>  Labels: pyspark
> Attachments: Reproducible example of Spark bug.pdf
>
>
> *History of the issue*
> SPARK-3159 implemented a method designed to reduce the computational burden 
> for predictions from decision trees and random forests by pruning the tree 
> after fitting. This is done in such a way that branches where child leaves 
> all produce the same classification prediction are merged.
> This was implemented via a PR: [https://github.com/apache/spark/pull/20632]
> This feature is controllable by a "prune" parameter in the Scala version of 
> the code, which is set to True as the default behaviour. However, this 
> parameter is not exposed in the Pyspark API, resulting in the behaviour above:
>  * Occurring always (despite the user may not wanting it to occur)
>  * Not being documented in the ML documentation, leading to decision tree 
> behavoiur that may be in conflict with what the user expects to happen
> *Why is this a problem?*
> +Problem 1: Inaccurate probabilities+
> Because the decision to prune is based on the classification prediction from 
> the tree (not the probability prediction from the node), this introduces 
> additional bias compared to the situation where the pruning is not done. The 
> impact here may be severe in some cases
> +Problem 2: Leads to completely unacceptable behaviours in some circumstances 
> and for some hyper-parameters+
> My colleagues and I encountered this bug in a scenario where we could not get 
> a decision tree classifier (or random forest classifier with a single tree) 
> to split a single node, despite this being eminently supported by the data. 
> This renders the decision trees and random forests complete unusable
> +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and 
> how they interact with the data+
> Small changes in the hyper-parameters should ideally produce small changes in 
> the built trees. However, here we have found that small changes in the 
> hyper-parameters lead to large and unpredictable changes in the resultant 
> trees as a result of this pruning.
> In principle, this high degree of instability means that re-training the same 
> model, with the same hyper-parameter settings, on slightly different data may 
> lead to large variations in the tree structure simply as a result of the 
> pruning
> +Problem 4: The problems above are much worse for unbalanced data sets+
> Probability estimation on unbalanced data sets using trees should be 
> supported, but the pruning method described will make this very difficult
> +Problem 5: This pruning method is a substantial variation from the 
> description of the decision tree algorithm in the MLLib documents and is not 
> described+
> This made it extremely confusing for us in working out why we were seeing 
> certain behaviours - we had to trace back through all of the Spark detailed 
> release notes to identify where the problem might.
> *Proposed solutions*
> +Option 1 (much easier):+
> The pr

[jira] [Created] (SPARK-35673) Spark fails on unrecognized hint in subquery

2021-06-07 Thread Willi Raschkowski (Jira)
Willi Raschkowski created SPARK-35673:
-

 Summary: Spark fails on unrecognized hint in subquery
 Key: SPARK-35673
 URL: https://issues.apache.org/jira/browse/SPARK-35673
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.1.2, 3.1.1, 3.0.2
Reporter: Willi Raschkowski


Spark fails on unrecognized hint in subquery.

To reproduce, try
{code:sql}
-- This succeeds with warning
SELECT /*+ use_hash */ 42;

-- This fails
SELECT *
FROM (
SELECT /*+ use_hash */ 42
);
{code}

The first statement gives you
{code}
21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
42
{code}
while the second statement gives you
{code}
21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
Error in query: unresolved operator 'Project [*];
'Project [*]
+- SubqueryAlias __auto_generated_subquery_name
   +- Project [42 AS 42#2]
  +- OneRowRelation
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35673) Spark fails on unrecognized hint in subquery

2021-06-07 Thread Willi Raschkowski (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-35673:
--
Description: 
Spark fails on unrecognized hint in subquery.

To reproduce:
{code:sql}
SELECT /*+ use_hash */ 42;
-- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- 42

SELECT *
FROM (
SELECT /*+ use_hash */ 42
);
-- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- Error in query: unresolved operator 'Project [*];
-- 'Project [*]
-- +- SubqueryAlias __auto_generated_subquery_name
--+- Project [42 AS 42#2]
--   +- OneRowRelation
{code}


  was:
Spark fails on unrecognized hint in subquery.

To reproduce, try
{code:sql}
-- This succeeds with warning
SELECT /*+ use_hash */ 42;

-- This fails
SELECT *
FROM (
SELECT /*+ use_hash */ 42
);
{code}

The first statement gives you
{code}
21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
42
{code}
while the second statement gives you
{code}
21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
Error in query: unresolved operator 'Project [*];
'Project [*]
+- SubqueryAlias __auto_generated_subquery_name
   +- Project [42 AS 42#2]
  +- OneRowRelation
{code}


> Spark fails on unrecognized hint in subquery
> 
>
> Key: SPARK-35673
> URL: https://issues.apache.org/jira/browse/SPARK-35673
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>
> Spark fails on unrecognized hint in subquery.
> To reproduce:
> {code:sql}
> SELECT /*+ use_hash */ 42;
> -- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
> -- 42
> SELECT *
> FROM (
> SELECT /*+ use_hash */ 42
> );
> -- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
> -- Error in query: unresolved operator 'Project [*];
> -- 'Project [*]
> -- +- SubqueryAlias __auto_generated_subquery_name
> --+- Project [42 AS 42#2]
> --   +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35673) Spark fails on unrecognized hint in subquery

2021-06-07 Thread Willi Raschkowski (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-35673:
--
Description: 
Spark queries to fail on unrecognized hints in subqueries. An example to 
reproduce:
{code:sql}
SELECT /*+ use_hash */ 42;
-- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- 42

SELECT *
FROM (
SELECT /*+ use_hash */ 42
);
-- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- Error in query: unresolved operator 'Project [*];
-- 'Project [*]
-- +- SubqueryAlias __auto_generated_subquery_name
--+- Project [42 AS 42#2]
--   +- OneRowRelation
{code}

  was:
Spark queries seem to fail on unrecognized hints in subqueries. An example to 
reproduce:
{code:sql}
SELECT /*+ use_hash */ 42;
-- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- 42

SELECT *
FROM (
SELECT /*+ use_hash */ 42
);
-- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- Error in query: unresolved operator 'Project [*];
-- 'Project [*]
-- +- SubqueryAlias __auto_generated_subquery_name
--+- Project [42 AS 42#2]
--   +- OneRowRelation
{code}


> Spark fails on unrecognized hint in subquery
> 
>
> Key: SPARK-35673
> URL: https://issues.apache.org/jira/browse/SPARK-35673
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>
> Spark queries to fail on unrecognized hints in subqueries. An example to 
> reproduce:
> {code:sql}
> SELECT /*+ use_hash */ 42;
> -- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
> -- 42
> SELECT *
> FROM (
> SELECT /*+ use_hash */ 42
> );
> -- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
> -- Error in query: unresolved operator 'Project [*];
> -- 'Project [*]
> -- +- SubqueryAlias __auto_generated_subquery_name
> --+- Project [42 AS 42#2]
> --   +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35673) Spark fails on unrecognized hint in subquery

2021-06-07 Thread Willi Raschkowski (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-35673:
--
Description: 
Spark queries seem to fail on unrecognized hints in subqueries. An example to 
reproduce:
{code:sql}
SELECT /*+ use_hash */ 42;
-- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- 42

SELECT *
FROM (
SELECT /*+ use_hash */ 42
);
-- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- Error in query: unresolved operator 'Project [*];
-- 'Project [*]
-- +- SubqueryAlias __auto_generated_subquery_name
--+- Project [42 AS 42#2]
--   +- OneRowRelation
{code}

  was:
Spark fails on unrecognized hint in subquery.

To reproduce:
{code:sql}
SELECT /*+ use_hash */ 42;
-- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- 42

SELECT *
FROM (
SELECT /*+ use_hash */ 42
);
-- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
-- Error in query: unresolved operator 'Project [*];
-- 'Project [*]
-- +- SubqueryAlias __auto_generated_subquery_name
--+- Project [42 AS 42#2]
--   +- OneRowRelation
{code}



> Spark fails on unrecognized hint in subquery
> 
>
> Key: SPARK-35673
> URL: https://issues.apache.org/jira/browse/SPARK-35673
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>
> Spark queries seem to fail on unrecognized hints in subqueries. An example to 
> reproduce:
> {code:sql}
> SELECT /*+ use_hash */ 42;
> -- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
> -- 42
> SELECT *
> FROM (
> SELECT /*+ use_hash */ 42
> );
> -- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
> -- Error in query: unresolved operator 'Project [*];
> -- 'Project [*]
> -- +- SubqueryAlias __auto_generated_subquery_name
> --+- Project [42 AS 42#2]
> --   +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35673) Spark fails on unrecognized hint in subquery

2021-06-07 Thread Willi Raschkowski (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-35673:
--
Issue Type: Bug  (was: Task)

> Spark fails on unrecognized hint in subquery
> 
>
> Key: SPARK-35673
> URL: https://issues.apache.org/jira/browse/SPARK-35673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.1.2
>Reporter: Willi Raschkowski
>Priority: Major
>
> Spark queries to fail on unrecognized hints in subqueries. An example to 
> reproduce:
> {code:sql}
> SELECT /*+ use_hash */ 42;
> -- 21/06/08 01:28:05 WARN HintErrorLogger: Unrecognized hint: use_hash()
> -- 42
> SELECT *
> FROM (
> SELECT /*+ use_hash */ 42
> );
> -- 21/06/08 01:28:07 WARN HintErrorLogger: Unrecognized hint: use_hash()
> -- Error in query: unresolved operator 'Project [*];
> -- 'Project [*]
> -- +- SubqueryAlias __auto_generated_subquery_name
> --+- Project [42 AS 42#2]
> --   +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29626) notEqual() should return true when the one is null, the other is not null

2021-06-07 Thread dc-heros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358969#comment-17358969
 ] 

dc-heros commented on SPARK-29626:
--

I would like to work on this

> notEqual() should return true when the one is null, the other is not null
> -
>
> Key: SPARK-29626
> URL: https://issues.apache.org/jira/browse/SPARK-29626
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: zhouhuazheng
>Priority: Minor
>
> the one is null,the other is not null, then use the function notEqual(), we 
> hope it return true . 
> eg: 
> scala> df.show()
> +--+---+
> | age| name|
> +--+---+
> | null|Michael|
> | 30| Andy|
> | 19| Justin|
> | 35| null|
> | 19| Justin|
> | null| null|
> |Justin| Justin|
> | 19| 19|
> +--+---+
> scala> df.filter(col("age").notEqual(col("name"))).show
> +---+--+
> |age| name|
> +---+--+
> | 30| Andy|
> | 19|Justin|
> | 19|Justin|
> +---+--+
> scala> df.filter(col("age").equalTo(col("name"))).show
> +--+--+
> | age| name|
> +--+--+
> | null| null|
> |Justin|Justin|
> | 19| 19|
> +--+--+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35667) spark.speculation causes incorrect query results with TRANSFORM

2021-06-07 Thread yuanxm (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuanxm updated SPARK-35667:
---
Attachment: image-2021-06-08-10-02-34-979.png

> spark.speculation causes incorrect query results with TRANSFORM
> ---
>
> Key: SPARK-35667
> URL: https://issues.apache.org/jira/browse/SPARK-35667
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8
>Reporter: yuanxm
>Priority: Major
> Attachments: image-2021-06-08-10-02-34-979.png
>
>
> SQL as follow gets incorrect results sometimes when spark.speculation is 
> true: 
> {code:java}
> SELECT count(1)
> FROM
>   (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt)
>FROM
>  (SELECT dt
>   FROM test_table)tmpa1)tmpa2{code}
> With spark.speculation=true, the count result is less than the correct one. 
> It's more likely to get incorrect result when there are more speculative 
> tasks. 
> `test.py`:
> {code:java}
> import sys
> for line in sys.stdin:
> line = line.strip()
> arr = line.split()
> print "\t".join(arr){code}
>  
> spark-sql command:
> {code:java}
> ./bin/spark-sql --master yarn \ 
> --conf spark.speculation=true \ 
> --conf spark.shuffle.service.enabled=true \ 
> --conf spark.dynamicAllocation.enabled=true \ 
> --conf spark.dynamicAllocation.executorIdleTimeout=5s \ 
> --conf spark.dynamicAllocation.initialExecutor=1 \ 
> --conf spark.dynamicAllocation.maxExecutors=40
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35667) spark.speculation causes incorrect query results with TRANSFORM

2021-06-07 Thread yuanxm (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358998#comment-17358998
 ] 

yuanxm commented on SPARK-35667:


FYI, here is a strange thing, the input size/records of the SUCCESS task is 
less than the KILLED task. [~xkrogen] [~vsowrirajan] [~ron8hu]

!image-2021-06-08-10-02-34-979.png!

> spark.speculation causes incorrect query results with TRANSFORM
> ---
>
> Key: SPARK-35667
> URL: https://issues.apache.org/jira/browse/SPARK-35667
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8
>Reporter: yuanxm
>Priority: Major
> Attachments: image-2021-06-08-10-02-34-979.png
>
>
> SQL as follow gets incorrect results sometimes when spark.speculation is 
> true: 
> {code:java}
> SELECT count(1)
> FROM
>   (SELECT TRANSFORM(tmpa1.*) USING "python test.py" AS (dt)
>FROM
>  (SELECT dt
>   FROM test_table)tmpa1)tmpa2{code}
> With spark.speculation=true, the count result is less than the correct one. 
> It's more likely to get incorrect result when there are more speculative 
> tasks. 
> `test.py`:
> {code:java}
> import sys
> for line in sys.stdin:
> line = line.strip()
> arr = line.split()
> print "\t".join(arr){code}
>  
> spark-sql command:
> {code:java}
> ./bin/spark-sql --master yarn \ 
> --conf spark.speculation=true \ 
> --conf spark.shuffle.service.enabled=true \ 
> --conf spark.dynamicAllocation.enabled=true \ 
> --conf spark.dynamicAllocation.executorIdleTimeout=5s \ 
> --conf spark.dynamicAllocation.initialExecutor=1 \ 
> --conf spark.dynamicAllocation.maxExecutors=40
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35638) Introduce InternalField to manage dtypes and StructFields.

2021-06-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35638.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32775
[https://github.com/apache/spark/pull/32775]

> Introduce InternalField to manage dtypes and StructFields.
> --
>
> Key: SPARK-35638
> URL: https://issues.apache.org/jira/browse/SPARK-35638
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently there are some performance issues in the pandas-on-Spark layer.
> One of them is accessing Java DataFrame and run analysis phase too many 
> times, especially just for retrieving the current column names or data types.
> We should reduce the amount of unnecessary access.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35638) Introduce InternalField to manage dtypes and StructFields.

2021-06-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35638:


Assignee: Takuya Ueshin

> Introduce InternalField to manage dtypes and StructFields.
> --
>
> Key: SPARK-35638
> URL: https://issues.apache.org/jira/browse/SPARK-35638
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>
> Currently there are some performance issues in the pandas-on-Spark layer.
> One of them is accessing Java DataFrame and run analysis phase too many 
> times, especially just for retrieving the current column names or data types.
> We should reduce the amount of unnecessary access.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35603) Add data source options link for R API documentation.

2021-06-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35603.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32797
[https://github.com/apache/spark/pull/32797]

> Add data source options link for R API documentation.
> -
>
> Key: SPARK-35603
> URL: https://issues.apache.org/jira/browse/SPARK-35603
> Project: Spark
>  Issue Type: Documentation
>  Components: docs, R
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.2.0
>
>
> We should add the data source options link for R documentation as well like 
> we did at https://issues.apache.org/jira/browse/SPARK-34491 .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35603) Add data source options link for R API documentation.

2021-06-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35603:


Assignee: Haejoon Lee

> Add data source options link for R API documentation.
> -
>
> Key: SPARK-35603
> URL: https://issues.apache.org/jira/browse/SPARK-35603
> Project: Spark
>  Issue Type: Documentation
>  Components: docs, R
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> We should add the data source options link for R documentation as well like 
> we did at https://issues.apache.org/jira/browse/SPARK-34491 .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35668) Use "concurrency" setting on Github Actions

2021-06-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35668:


Assignee: Yikun Jiang

> Use "concurrency" setting on Github Actions
> ---
>
> Key: SPARK-35668
> URL: https://issues.apache.org/jira/browse/SPARK-35668
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.1.2
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>
> We are using 
> [cancel_duplicate_workflow_runs](https://github.com/apache/spark/blob/a70e66ecfa638cacc99b4e9a7c464e41ec92ad30/.github/workflows/cancel_duplicate_workflow_runs.yml#L1)
>  job to cancel previous jobs when a new job is queued. Now, it has been 
> supported by the github action by using 
> ["concurrency"](https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#concurrency)
>  syntax to make sure only a single job or workflow using the same concurrency 
> group.
> related: https://github.com/apache/arrow/pull/10416 and 
> https://github.com/potiuk/cancel-workflow-runs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35668) Use "concurrency" setting on Github Actions

2021-06-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35668.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32806
[https://github.com/apache/spark/pull/32806]

> Use "concurrency" setting on Github Actions
> ---
>
> Key: SPARK-35668
> URL: https://issues.apache.org/jira/browse/SPARK-35668
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.1.2
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.2.0
>
>
> We are using 
> [cancel_duplicate_workflow_runs](https://github.com/apache/spark/blob/a70e66ecfa638cacc99b4e9a7c464e41ec92ad30/.github/workflows/cancel_duplicate_workflow_runs.yml#L1)
>  job to cancel previous jobs when a new job is queued. Now, it has been 
> supported by the github action by using 
> ["concurrency"](https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#concurrency)
>  syntax to make sure only a single job or workflow using the same concurrency 
> group.
> related: https://github.com/apache/arrow/pull/10416 and 
> https://github.com/potiuk/cancel-workflow-runs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35636) Do not push down extract value in higher order function that references both sides of a join

2021-06-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359025#comment-17359025
 ] 

Apache Spark commented on SPARK-35636:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32812

> Do not push down extract value in higher order function that references both 
> sides of a join
> 
>
> Key: SPARK-35636
> URL: https://issues.apache.org/jira/browse/SPARK-35636
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karen Feng
>Assignee: Karen Feng
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, lambda keys can be referenced outside of the lambda function:
> {quote}Project [transform(keys#0, lambdafunction(_extract_v1#0, lambda key#0, 
> false)) AS a#0]
> +- 'Join Cross
> :- Project [kvs#0[lambda key#0].v1 AS _extract_v1#0]
> :  +- LocalRelation , [kvs#0]
> +- LocalRelation , [keys#0]{quote}
> This should be unchanged from the original state:
> {quote}Project [transform(keys#418, lambdafunction(kvs#417[lambda 
> key#420].v1, lambda key#420, false)) AS a#419]
> +- Join Cross
> :- LocalRelation , [kvs#417]
> +- LocalRelation , [keys#418]{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34591) Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to

2021-06-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359034#comment-17359034
 ] 

Apache Spark commented on SPARK-34591:
--

User 'CBribiescas' has created a pull request for this issue:
https://github.com/apache/spark/pull/32813

> Pyspark undertakes pruning of decision trees and random forests outside the 
> control of the user, leading to undesirable and unexpected outcomes that are 
> challenging to diagnose and impossible to correct
> --
>
> Key: SPARK-34591
> URL: https://issues.apache.org/jira/browse/SPARK-34591
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.4, 3.1.1
>Reporter: Julian King
>Priority: Major
>  Labels: pyspark
> Attachments: Reproducible example of Spark bug.pdf
>
>
> *History of the issue*
> SPARK-3159 implemented a method designed to reduce the computational burden 
> for predictions from decision trees and random forests by pruning the tree 
> after fitting. This is done in such a way that branches where child leaves 
> all produce the same classification prediction are merged.
> This was implemented via a PR: [https://github.com/apache/spark/pull/20632]
> This feature is controllable by a "prune" parameter in the Scala version of 
> the code, which is set to True as the default behaviour. However, this 
> parameter is not exposed in the Pyspark API, resulting in the behaviour above:
>  * Occurring always (despite the user may not wanting it to occur)
>  * Not being documented in the ML documentation, leading to decision tree 
> behavoiur that may be in conflict with what the user expects to happen
> *Why is this a problem?*
> +Problem 1: Inaccurate probabilities+
> Because the decision to prune is based on the classification prediction from 
> the tree (not the probability prediction from the node), this introduces 
> additional bias compared to the situation where the pruning is not done. The 
> impact here may be severe in some cases
> +Problem 2: Leads to completely unacceptable behaviours in some circumstances 
> and for some hyper-parameters+
> My colleagues and I encountered this bug in a scenario where we could not get 
> a decision tree classifier (or random forest classifier with a single tree) 
> to split a single node, despite this being eminently supported by the data. 
> This renders the decision trees and random forests complete unusable
> +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and 
> how they interact with the data+
> Small changes in the hyper-parameters should ideally produce small changes in 
> the built trees. However, here we have found that small changes in the 
> hyper-parameters lead to large and unpredictable changes in the resultant 
> trees as a result of this pruning.
> In principle, this high degree of instability means that re-training the same 
> model, with the same hyper-parameter settings, on slightly different data may 
> lead to large variations in the tree structure simply as a result of the 
> pruning
> +Problem 4: The problems above are much worse for unbalanced data sets+
> Probability estimation on unbalanced data sets using trees should be 
> supported, but the pruning method described will make this very difficult
> +Problem 5: This pruning method is a substantial variation from the 
> description of the decision tree algorithm in the MLLib documents and is not 
> described+
> This made it extremely confusing for us in working out why we were seeing 
> certain behaviours - we had to trace back through all of the Spark detailed 
> release notes to identify where the problem might.
> *Proposed solutions*
> +Option 1 (much easier):+
> The proposed solution here is:
>  * Set the default pruning behaviour to False rather than True, thereby 
> bringing the default behaviour back into alignment with the documentation 
> whilst avoiding the issues described above
> +Option 2 (more involved):+
> The proposed solution here is:
>  * Leave the default pruning behaviour set to False
>  * Expand the pyspark API to expose the pruning behaviour as a 
> user-controllable option
>  * Document the change to the API
>  * Document the change to the tree building behaviour at appropriate points 
> in the Spark ML and Spark MLLib documentation
> We recommend that the default behaviour be set to False because this approach 
> is not the generally understood approach for building decision trees, where 
> pruning is decided a separate and user-controllable step.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SPARK-34591) Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to

2021-06-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359033#comment-17359033
 ] 

Apache Spark commented on SPARK-34591:
--

User 'CBribiescas' has created a pull request for this issue:
https://github.com/apache/spark/pull/32813

> Pyspark undertakes pruning of decision trees and random forests outside the 
> control of the user, leading to undesirable and unexpected outcomes that are 
> challenging to diagnose and impossible to correct
> --
>
> Key: SPARK-34591
> URL: https://issues.apache.org/jira/browse/SPARK-34591
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.4, 3.1.1
>Reporter: Julian King
>Priority: Major
>  Labels: pyspark
> Attachments: Reproducible example of Spark bug.pdf
>
>
> *History of the issue*
> SPARK-3159 implemented a method designed to reduce the computational burden 
> for predictions from decision trees and random forests by pruning the tree 
> after fitting. This is done in such a way that branches where child leaves 
> all produce the same classification prediction are merged.
> This was implemented via a PR: [https://github.com/apache/spark/pull/20632]
> This feature is controllable by a "prune" parameter in the Scala version of 
> the code, which is set to True as the default behaviour. However, this 
> parameter is not exposed in the Pyspark API, resulting in the behaviour above:
>  * Occurring always (despite the user may not wanting it to occur)
>  * Not being documented in the ML documentation, leading to decision tree 
> behavoiur that may be in conflict with what the user expects to happen
> *Why is this a problem?*
> +Problem 1: Inaccurate probabilities+
> Because the decision to prune is based on the classification prediction from 
> the tree (not the probability prediction from the node), this introduces 
> additional bias compared to the situation where the pruning is not done. The 
> impact here may be severe in some cases
> +Problem 2: Leads to completely unacceptable behaviours in some circumstances 
> and for some hyper-parameters+
> My colleagues and I encountered this bug in a scenario where we could not get 
> a decision tree classifier (or random forest classifier with a single tree) 
> to split a single node, despite this being eminently supported by the data. 
> This renders the decision trees and random forests complete unusable
> +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and 
> how they interact with the data+
> Small changes in the hyper-parameters should ideally produce small changes in 
> the built trees. However, here we have found that small changes in the 
> hyper-parameters lead to large and unpredictable changes in the resultant 
> trees as a result of this pruning.
> In principle, this high degree of instability means that re-training the same 
> model, with the same hyper-parameter settings, on slightly different data may 
> lead to large variations in the tree structure simply as a result of the 
> pruning
> +Problem 4: The problems above are much worse for unbalanced data sets+
> Probability estimation on unbalanced data sets using trees should be 
> supported, but the pruning method described will make this very difficult
> +Problem 5: This pruning method is a substantial variation from the 
> description of the decision tree algorithm in the MLLib documents and is not 
> described+
> This made it extremely confusing for us in working out why we were seeing 
> certain behaviours - we had to trace back through all of the Spark detailed 
> release notes to identify where the problem might.
> *Proposed solutions*
> +Option 1 (much easier):+
> The proposed solution here is:
>  * Set the default pruning behaviour to False rather than True, thereby 
> bringing the default behaviour back into alignment with the documentation 
> whilst avoiding the issues described above
> +Option 2 (more involved):+
> The proposed solution here is:
>  * Leave the default pruning behaviour set to False
>  * Expand the pyspark API to expose the pruning behaviour as a 
> user-controllable option
>  * Document the change to the API
>  * Document the change to the tree building behaviour at appropriate points 
> in the Spark ML and Spark MLLib documentation
> We recommend that the default behaviour be set to False because this approach 
> is not the generally understood approach for building decision trees, where 
> pruning is decided a separate and user-controllable step.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (SPARK-34591) Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34591:


Assignee: Apache Spark

> Pyspark undertakes pruning of decision trees and random forests outside the 
> control of the user, leading to undesirable and unexpected outcomes that are 
> challenging to diagnose and impossible to correct
> --
>
> Key: SPARK-34591
> URL: https://issues.apache.org/jira/browse/SPARK-34591
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.4, 3.1.1
>Reporter: Julian King
>Assignee: Apache Spark
>Priority: Major
>  Labels: pyspark
> Attachments: Reproducible example of Spark bug.pdf
>
>
> *History of the issue*
> SPARK-3159 implemented a method designed to reduce the computational burden 
> for predictions from decision trees and random forests by pruning the tree 
> after fitting. This is done in such a way that branches where child leaves 
> all produce the same classification prediction are merged.
> This was implemented via a PR: [https://github.com/apache/spark/pull/20632]
> This feature is controllable by a "prune" parameter in the Scala version of 
> the code, which is set to True as the default behaviour. However, this 
> parameter is not exposed in the Pyspark API, resulting in the behaviour above:
>  * Occurring always (despite the user may not wanting it to occur)
>  * Not being documented in the ML documentation, leading to decision tree 
> behavoiur that may be in conflict with what the user expects to happen
> *Why is this a problem?*
> +Problem 1: Inaccurate probabilities+
> Because the decision to prune is based on the classification prediction from 
> the tree (not the probability prediction from the node), this introduces 
> additional bias compared to the situation where the pruning is not done. The 
> impact here may be severe in some cases
> +Problem 2: Leads to completely unacceptable behaviours in some circumstances 
> and for some hyper-parameters+
> My colleagues and I encountered this bug in a scenario where we could not get 
> a decision tree classifier (or random forest classifier with a single tree) 
> to split a single node, despite this being eminently supported by the data. 
> This renders the decision trees and random forests complete unusable
> +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and 
> how they interact with the data+
> Small changes in the hyper-parameters should ideally produce small changes in 
> the built trees. However, here we have found that small changes in the 
> hyper-parameters lead to large and unpredictable changes in the resultant 
> trees as a result of this pruning.
> In principle, this high degree of instability means that re-training the same 
> model, with the same hyper-parameter settings, on slightly different data may 
> lead to large variations in the tree structure simply as a result of the 
> pruning
> +Problem 4: The problems above are much worse for unbalanced data sets+
> Probability estimation on unbalanced data sets using trees should be 
> supported, but the pruning method described will make this very difficult
> +Problem 5: This pruning method is a substantial variation from the 
> description of the decision tree algorithm in the MLLib documents and is not 
> described+
> This made it extremely confusing for us in working out why we were seeing 
> certain behaviours - we had to trace back through all of the Spark detailed 
> release notes to identify where the problem might.
> *Proposed solutions*
> +Option 1 (much easier):+
> The proposed solution here is:
>  * Set the default pruning behaviour to False rather than True, thereby 
> bringing the default behaviour back into alignment with the documentation 
> whilst avoiding the issues described above
> +Option 2 (more involved):+
> The proposed solution here is:
>  * Leave the default pruning behaviour set to False
>  * Expand the pyspark API to expose the pruning behaviour as a 
> user-controllable option
>  * Document the change to the API
>  * Document the change to the tree building behaviour at appropriate points 
> in the Spark ML and Spark MLLib documentation
> We recommend that the default behaviour be set to False because this approach 
> is not the generally understood approach for building decision trees, where 
> pruning is decided a separate and user-controllable step.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Assigned] (SPARK-34591) Pyspark undertakes pruning of decision trees and random forests outside the control of the user, leading to undesirable and unexpected outcomes that are challenging to

2021-06-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34591:


Assignee: (was: Apache Spark)

> Pyspark undertakes pruning of decision trees and random forests outside the 
> control of the user, leading to undesirable and unexpected outcomes that are 
> challenging to diagnose and impossible to correct
> --
>
> Key: SPARK-34591
> URL: https://issues.apache.org/jira/browse/SPARK-34591
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.4, 3.1.1
>Reporter: Julian King
>Priority: Major
>  Labels: pyspark
> Attachments: Reproducible example of Spark bug.pdf
>
>
> *History of the issue*
> SPARK-3159 implemented a method designed to reduce the computational burden 
> for predictions from decision trees and random forests by pruning the tree 
> after fitting. This is done in such a way that branches where child leaves 
> all produce the same classification prediction are merged.
> This was implemented via a PR: [https://github.com/apache/spark/pull/20632]
> This feature is controllable by a "prune" parameter in the Scala version of 
> the code, which is set to True as the default behaviour. However, this 
> parameter is not exposed in the Pyspark API, resulting in the behaviour above:
>  * Occurring always (despite the user may not wanting it to occur)
>  * Not being documented in the ML documentation, leading to decision tree 
> behavoiur that may be in conflict with what the user expects to happen
> *Why is this a problem?*
> +Problem 1: Inaccurate probabilities+
> Because the decision to prune is based on the classification prediction from 
> the tree (not the probability prediction from the node), this introduces 
> additional bias compared to the situation where the pruning is not done. The 
> impact here may be severe in some cases
> +Problem 2: Leads to completely unacceptable behaviours in some circumstances 
> and for some hyper-parameters+
> My colleagues and I encountered this bug in a scenario where we could not get 
> a decision tree classifier (or random forest classifier with a single tree) 
> to split a single node, despite this being eminently supported by the data. 
> This renders the decision trees and random forests complete unusable
> +Problem 3: Outcomes are highly sensitive to the hyper-parameters chosen, and 
> how they interact with the data+
> Small changes in the hyper-parameters should ideally produce small changes in 
> the built trees. However, here we have found that small changes in the 
> hyper-parameters lead to large and unpredictable changes in the resultant 
> trees as a result of this pruning.
> In principle, this high degree of instability means that re-training the same 
> model, with the same hyper-parameter settings, on slightly different data may 
> lead to large variations in the tree structure simply as a result of the 
> pruning
> +Problem 4: The problems above are much worse for unbalanced data sets+
> Probability estimation on unbalanced data sets using trees should be 
> supported, but the pruning method described will make this very difficult
> +Problem 5: This pruning method is a substantial variation from the 
> description of the decision tree algorithm in the MLLib documents and is not 
> described+
> This made it extremely confusing for us in working out why we were seeing 
> certain behaviours - we had to trace back through all of the Spark detailed 
> release notes to identify where the problem might.
> *Proposed solutions*
> +Option 1 (much easier):+
> The proposed solution here is:
>  * Set the default pruning behaviour to False rather than True, thereby 
> bringing the default behaviour back into alignment with the documentation 
> whilst avoiding the issues described above
> +Option 2 (more involved):+
> The proposed solution here is:
>  * Leave the default pruning behaviour set to False
>  * Expand the pyspark API to expose the pruning behaviour as a 
> user-controllable option
>  * Document the change to the API
>  * Document the change to the tree building behaviour at appropriate points 
> in the Spark ML and Spark MLLib documentation
> We recommend that the default behaviour be set to False because this approach 
> is not the generally understood approach for building decision trees, where 
> pruning is decided a separate and user-controllable step.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands,

  1   2   >