[jira] [Assigned] (SPARK-37028) Add a 'kill' executor link in the Web UI.

2021-10-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37028:


Assignee: Apache Spark

>  Add a 'kill' executor link in the Web UI.
> --
>
> Key: SPARK-37028
> URL: https://issues.apache.org/jira/browse/SPARK-37028
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: weixiuli
>Assignee: Apache Spark
>Priority: Major
>
> The executor which is running in a bad node(eg. The system is overloaded or 
> disks are busy) or has big GC overheads may affect the efficiency of job 
> execution, although there are speculative mechanisms to resolve this problem, 
> but sometimes the speculated task may also run in a bad executor.
>  We should have a 'kill' link for each executor, similar to what we have for 
> each stage, so it's easier for users to kill executors in the UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37028) Add a 'kill' executor link in the Web UI.

2021-10-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37028:


Assignee: (was: Apache Spark)

>  Add a 'kill' executor link in the Web UI.
> --
>
> Key: SPARK-37028
> URL: https://issues.apache.org/jira/browse/SPARK-37028
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: weixiuli
>Priority: Major
>
> The executor which is running in a bad node(eg. The system is overloaded or 
> disks are busy) or has big GC overheads may affect the efficiency of job 
> execution, although there are speculative mechanisms to resolve this problem, 
> but sometimes the speculated task may also run in a bad executor.
>  We should have a 'kill' link for each executor, similar to what we have for 
> each stage, so it's easier for users to kill executors in the UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37028) Add a 'kill' executor link in the Web UI.

2021-10-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429652#comment-17429652
 ] 

Apache Spark commented on SPARK-37028:
--

User 'weixiuli' has created a pull request for this issue:
https://github.com/apache/spark/pull/34302

>  Add a 'kill' executor link in the Web UI.
> --
>
> Key: SPARK-37028
> URL: https://issues.apache.org/jira/browse/SPARK-37028
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: weixiuli
>Priority: Major
>
> The executor which is running in a bad node(eg. The system is overloaded or 
> disks are busy) or has big GC overheads may affect the efficiency of job 
> execution, although there are speculative mechanisms to resolve this problem, 
> but sometimes the speculated task may also run in a bad executor.
>  We should have a 'kill' link for each executor, similar to what we have for 
> each stage, so it's easier for users to kill executors in the UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37028) Add a 'kill' executor link in the Web UI.

2021-10-16 Thread weixiuli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weixiuli updated SPARK-37028:
-
Description: 
The executor which is running in a bad node(eg. The system is overloaded or 
disks are busy) or has big GC overheads may affect the efficiency of job 
execution, although there are speculative mechanisms to resolve this problem, 
but sometimes the speculated task may also run in a bad executor.
 We should have a 'kill' link for each executor, similar to what we have for 
each stage, so it's easier for users to kill executors in the UI.

  was:
The executor which is running in a bad node(eg. The system is overloaded or 
disks are busy) or has big GC overheads may affect the efficiency of job 
execution, although there are speculative mechanisms to resolve this 
problem,but sometimes the speculated task may also run in a bad executor.
 We should have a 'kill' link for each executor, similar to what we have for 
each stage, so it's easier for users to kill executors in the UI.


>  Add a 'kill' executor link in the Web UI.
> --
>
> Key: SPARK-37028
> URL: https://issues.apache.org/jira/browse/SPARK-37028
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: weixiuli
>Priority: Major
>
> The executor which is running in a bad node(eg. The system is overloaded or 
> disks are busy) or has big GC overheads may affect the efficiency of job 
> execution, although there are speculative mechanisms to resolve this problem, 
> but sometimes the speculated task may also run in a bad executor.
>  We should have a 'kill' link for each executor, similar to what we have for 
> each stage, so it's easier for users to kill executors in the UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37028) Add a 'kill' executor link in the Web UI.

2021-10-16 Thread weixiuli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weixiuli updated SPARK-37028:
-
Description: 
The executor which is running in a bad node(eg. The system is overloaded or 
disks are busy) or has big GC overheads may affect the efficiency of job 
execution, although there are speculative mechanisms to resolve this 
problem,but sometimes the speculated task may also run in a bad executor.
 We should have a 'kill' link for each executor, similar to what we have for 
each stage, so it's easier for users to kill executors in the UI.

  was:
The executor which is running in a bad node(eg. The system is overloaded or 
disks are busy) or has big GC overheads may affect the efficiency of job 
execution, although there are speculative mechanisms to resolve this 
problem,but sometimes the speculated task may also run in a bad executor.
We should have a "kill" link for each executor, similar to what we have for 
each stage, so it's easier for users to kill executors in the UI.


>  Add a 'kill' executor link in the Web UI.
> --
>
> Key: SPARK-37028
> URL: https://issues.apache.org/jira/browse/SPARK-37028
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: weixiuli
>Priority: Major
>
> The executor which is running in a bad node(eg. The system is overloaded or 
> disks are busy) or has big GC overheads may affect the efficiency of job 
> execution, although there are speculative mechanisms to resolve this 
> problem,but sometimes the speculated task may also run in a bad executor.
>  We should have a 'kill' link for each executor, similar to what we have for 
> each stage, so it's easier for users to kill executors in the UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37028) Add a 'kill' executor link in the Web UI.

2021-10-16 Thread weixiuli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weixiuli updated SPARK-37028:
-
Summary:  Add a 'kill' executor link in the Web UI.  (was:  Add a 'kill' 
executor link in Web UI.)

>  Add a 'kill' executor link in the Web UI.
> --
>
> Key: SPARK-37028
> URL: https://issues.apache.org/jira/browse/SPARK-37028
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: weixiuli
>Priority: Major
>
> The executor which is running in a bad node(eg. The system is overloaded or 
> disks are busy) or has big GC overheads may affect the efficiency of job 
> execution, although there are speculative mechanisms to resolve this 
> problem,but sometimes the speculated task may also run in a bad executor.
> We should have a "kill" link for each executor, similar to what we have for 
> each stage, so it's easier for users to kill executors in the UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37028) Add a 'kill' executor link in Web UI.

2021-10-16 Thread weixiuli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weixiuli updated SPARK-37028:
-
Description: 
The executor which is running in a bad node(eg. The system is overloaded or 
disks are busy) or has big GC overheads may affect the efficiency of job 
execution, although there are speculative mechanisms to resolve this 
problem,but sometimes the speculated task may also run in a bad executor.
We should have a "kill" link for each executor, similar to what we have for 
each stage, so it's easier for users to kill executors in the UI.

  was:
The executor which is running in a bad node(eg. The system is overloaded or 
disks are busy) or it has big GC overheads may affect the efficiency of job 
execution, although there are speculative mechanisms to resolve this 
problem,but sometimes the speculated task may also run in a bad executor.
We should have a "kill" link for each executor, similar to what we have for 
each stage, so it's easier for users to kill executors in the UI.


>  Add a 'kill' executor link in Web UI.
> --
>
> Key: SPARK-37028
> URL: https://issues.apache.org/jira/browse/SPARK-37028
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: weixiuli
>Priority: Major
>
> The executor which is running in a bad node(eg. The system is overloaded or 
> disks are busy) or has big GC overheads may affect the efficiency of job 
> execution, although there are speculative mechanisms to resolve this 
> problem,but sometimes the speculated task may also run in a bad executor.
> We should have a "kill" link for each executor, similar to what we have for 
> each stage, so it's easier for users to kill executors in the UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37028) Add a 'kill' executor link in Web UI.

2021-10-16 Thread weixiuli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weixiuli updated SPARK-37028:
-
Description: 
The executor which is running in a bad node(eg. The system is overloaded or 
disks are busy) or it has big GC overheads may affect the efficiency of job 
execution, although there are speculative mechanisms to resolve this 
problem,but sometimes the speculated task may also run in a bad executor.
We should have a "kill" link for each executor, similar to what we have for 
each stage, so it's easier for users to kill executors in the UI.

  was:Add a 'kill' executors link in Web UI.


>  Add a 'kill' executor link in Web UI.
> --
>
> Key: SPARK-37028
> URL: https://issues.apache.org/jira/browse/SPARK-37028
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: weixiuli
>Priority: Major
>
> The executor which is running in a bad node(eg. The system is overloaded or 
> disks are busy) or it has big GC overheads may affect the efficiency of job 
> execution, although there are speculative mechanisms to resolve this 
> problem,but sometimes the speculated task may also run in a bad executor.
> We should have a "kill" link for each executor, similar to what we have for 
> each stage, so it's easier for users to kill executors in the UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37028) Add a 'kill' executor link in Web UI.

2021-10-16 Thread weixiuli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weixiuli updated SPARK-37028:
-
Summary:  Add a 'kill' executor link in Web UI.  (was:  Add a 'kill' 
executors link in Web UI.)

>  Add a 'kill' executor link in Web UI.
> --
>
> Key: SPARK-37028
> URL: https://issues.apache.org/jira/browse/SPARK-37028
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: weixiuli
>Priority: Major
>
> Add a 'kill' executors link in Web UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37028) Add a 'kill' executors link in Web UI.

2021-10-16 Thread weixiuli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weixiuli updated SPARK-37028:
-
Summary:  Add a 'kill' executors link in Web UI.  (was:  Add  'kill' 
executors link in Web UI.)

>  Add a 'kill' executors link in Web UI.
> ---
>
> Key: SPARK-37028
> URL: https://issues.apache.org/jira/browse/SPARK-37028
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: weixiuli
>Priority: Major
>
> Add a 'kill' executors link in Web UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37028) Add 'kill' executors link in Web UI.

2021-10-16 Thread weixiuli (Jira)
weixiuli created SPARK-37028:


 Summary:  Add  'kill' executors link in Web UI.
 Key: SPARK-37028
 URL: https://issues.apache.org/jira/browse/SPARK-37028
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: weixiuli


Add 'kill' executors link in Web UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37028) Add 'kill' executors link in Web UI.

2021-10-16 Thread weixiuli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weixiuli updated SPARK-37028:
-
Description: Add a 'kill' executors link in Web UI.  (was: Add 'kill' 
executors link in Web UI.)

>  Add  'kill' executors link in Web UI.
> --
>
> Key: SPARK-37028
> URL: https://issues.apache.org/jira/browse/SPARK-37028
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: weixiuli
>Priority: Major
>
> Add a 'kill' executors link in Web UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37026) Ensure the element type of ResolvedRFormula.terms is scala.Seq for Scala 2.13

2021-10-16 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-37026:
---
Component/s: Build

> Ensure the element type of ResolvedRFormula.terms is scala.Seq for Scala 2.13
> -
>
> Key: SPARK-37026
> URL: https://issues.apache.org/jira/browse/SPARK-37026
> Project: Spark
>  Issue Type: Bug
>  Components: Build, ML
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> ResolvedRFormula.toString throws ClassCastException with Scala 2.13 because 
> the type of ResolvedRFormula.terms is scala.Seq[scala.Seq[String]] but 
> scala.Seq[scala.collection.mutable.ArraySeq$ofRef] will be passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37027) Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES

2021-10-16 Thread Yuzhou Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuzhou Sun updated SPARK-37027:
---
Attachment: SPARK-37027-test-example.patch

> Fix behavior inconsistent in Hive table when ‘path’ is provided in 
> SERDEPROPERTIES
> --
>
> Key: SPARK-37027
> URL: https://issues.apache.org/jira/browse/SPARK-37027
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5, 3.1.2
>Reporter: Yuzhou Sun
>Priority: Trivial
> Attachments: SPARK-37027-test-example.patch
>
>
> If a Hive table is created with both {{WITH SERDEPROPERTIES 
> ('path'='')}} and {{LOCATION }}, Spark can 
> return doubled rows when reading the table. This issue seems to be an 
> extension of SPARK-30507.
>  Reproduce steps:
>  # Create table and insert records via Hive (Spark doesn't allow to insert 
> into table like this)
> {code:sql}
> CREATE TABLE `test_table`(
>   `c1` LONG,
>   `c2` STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> WITH SERDEPROPERTIES ('path'=''" )
> STORED AS
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>   OUTPUTFORMAT 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION '';
> INSERT INTO TABLE `test_table`
> VALUES (0, '0');
> SELECT * FROM `test_table`;
> -- will return
> -- 0 0
> {code}
>  # Read above table from Spark
> {code:sql}
> SELECT * FROM `test_table`;
> -- will return
> -- 0 0
> -- 0 0
> {code}
> But if we set {{spark.sql.hive.convertMetastoreParquet=false}}, Spark will 
> return same result as Hive (i.e. single row)
> A similar case is that, if a Hive table is created with both {{WITH 
> SERDEPROPERTIES ('path'='')}} and {{LOCATION }}, 
> Spark will read both rows under {{anotherPath}} and rows under 
> {{tableLocation}}, regardless of {{spark.sql.hive.convertMetastoreParquet}} 
> ‘s value. However, actually Hive seems to return only rows under 
> {{tableLocation}}
> Another similar case is that, if {{path}} is provided in {{TBLPROPERTIES}}, 
> Spark won’t double the rows when {{'path'=''}}. If 
> {{'path'=''}}, Spark will read both rows under {{anotherPath}} 
> and rows under {{tableLocation}}, Hive seems to keep ignoring the {{path}} in 
> {{TBLPROPERTIES}}
> Code examples for the above cases (diff patch wrote in 
> {{HiveParquetMetastoreSuite.scala}}) can be found in Attachments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37027) Fix behavior inconsistent in Hive table when ‘path’ is provided in SERDEPROPERTIES

2021-10-16 Thread Yuzhou Sun (Jira)
Yuzhou Sun created SPARK-37027:
--

 Summary: Fix behavior inconsistent in Hive table when ‘path’ is 
provided in SERDEPROPERTIES
 Key: SPARK-37027
 URL: https://issues.apache.org/jira/browse/SPARK-37027
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2, 2.4.5
Reporter: Yuzhou Sun


If a Hive table is created with both {{WITH SERDEPROPERTIES 
('path'='')}} and {{LOCATION }}, Spark can return 
doubled rows when reading the table. This issue seems to be an extension of 
SPARK-30507.


 Reproduce steps:
 # Create table and insert records via Hive (Spark doesn't allow to insert into 
table like this)
{code:sql}
CREATE TABLE `test_table`(
  `c1` LONG,
  `c2` STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('path'=''" )
STORED AS
  INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '';

INSERT INTO TABLE `test_table`
VALUES (0, '0');

SELECT * FROM `test_table`;
-- will return
-- 0 0
{code}

 # Read above table from Spark
{code:sql}
SELECT * FROM `test_table`;
-- will return
-- 0 0
-- 0 0
{code}

But if we set {{spark.sql.hive.convertMetastoreParquet=false}}, Spark will 
return same result as Hive (i.e. single row)

A similar case is that, if a Hive table is created with both {{WITH 
SERDEPROPERTIES ('path'='')}} and {{LOCATION }}, 
Spark will read both rows under {{anotherPath}} and rows under 
{{tableLocation}}, regardless of {{spark.sql.hive.convertMetastoreParquet}} ‘s 
value. However, actually Hive seems to return only rows under {{tableLocation}}

Another similar case is that, if {{path}} is provided in {{TBLPROPERTIES}}, 
Spark won’t double the rows when {{'path'=''}}. If 
{{'path'=''}}, Spark will read both rows under {{anotherPath}} and 
rows under {{tableLocation}}, Hive seems to keep ignoring the {{path}} in 
{{TBLPROPERTIES}}

Code examples for the above cases (diff patch wrote in 
{{HiveParquetMetastoreSuite.scala}}) can be found in Attachments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37026) Ensure the element type of ResolvedRFormula.terms is scala.Seq for Scala 2.13

2021-10-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429644#comment-17429644
 ] 

Apache Spark commented on SPARK-37026:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34301

> Ensure the element type of ResolvedRFormula.terms is scala.Seq for Scala 2.13
> -
>
> Key: SPARK-37026
> URL: https://issues.apache.org/jira/browse/SPARK-37026
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> ResolvedRFormula.toString throws ClassCastException with Scala 2.13 because 
> the type of ResolvedRFormula.terms is scala.Seq[scala.Seq[String]] but 
> scala.Seq[scala.collection.mutable.ArraySeq$ofRef] will be passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37026) Ensure the element type of ResolvedRFormula.terms is scala.Seq for Scala 2.13

2021-10-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37026:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Ensure the element type of ResolvedRFormula.terms is scala.Seq for Scala 2.13
> -
>
> Key: SPARK-37026
> URL: https://issues.apache.org/jira/browse/SPARK-37026
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> ResolvedRFormula.toString throws ClassCastException with Scala 2.13 because 
> the type of ResolvedRFormula.terms is scala.Seq[scala.Seq[String]] but 
> scala.Seq[scala.collection.mutable.ArraySeq$ofRef] will be passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37026) Ensure the element type of ResolvedRFormula.terms is scala.Seq for Scala 2.13

2021-10-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37026:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Ensure the element type of ResolvedRFormula.terms is scala.Seq for Scala 2.13
> -
>
> Key: SPARK-37026
> URL: https://issues.apache.org/jira/browse/SPARK-37026
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> ResolvedRFormula.toString throws ClassCastException with Scala 2.13 because 
> the type of ResolvedRFormula.terms is scala.Seq[scala.Seq[String]] but 
> scala.Seq[scala.collection.mutable.ArraySeq$ofRef] will be passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37026) Ensure the element type of ResolvedRFormula.terms is scala.Seq for Scala 2.13

2021-10-16 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-37026:
---
Summary: Ensure the element type of ResolvedRFormula.terms is scala.Seq for 
Scala 2.13  (was: Ensure the element type of RFormula.terms is scala.Seq for 
Scala 2.13)

> Ensure the element type of ResolvedRFormula.terms is scala.Seq for Scala 2.13
> -
>
> Key: SPARK-37026
> URL: https://issues.apache.org/jira/browse/SPARK-37026
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> ResolvedRFormula.toString throws ClassCastException with Scala 2.13 because 
> the type of ResolvedRFormula.terms is scala.Seq[scala.Seq[String]] but 
> scala.Seq[scala.collection.mutable.ArraySeq$ofRef] will be passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37026) Ensure the element type of RFormula.terms is scala.Seq for Scala 2.13

2021-10-16 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-37026:
---
Summary: Ensure the element type of RFormula.terms is scala.Seq for Scala 
2.13  (was: ResolvedRFormula.toString throws ClassCastException with Scala 2.13)

> Ensure the element type of RFormula.terms is scala.Seq for Scala 2.13
> -
>
> Key: SPARK-37026
> URL: https://issues.apache.org/jira/browse/SPARK-37026
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> ResolvedRFormula.toString throws ClassCastException with Scala 2.13 because 
> the type of ResolvedRFormula.terms is scala.Seq[scala.Seq[String]] but 
> scala.Seq[scala.collection.mutable.ArraySeq$ofRef] will be passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37026) ResolvedRFormula.toString throws ClassCastException with Scala 2.13

2021-10-16 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-37026:
---
Description: ResolvedRFormula.toString throws ClassCastException with Scala 
2.13 because the type of ResolvedRFormula.terms is scala.Seq[scala.Seq[String]] 
but scala.Seq[scala.collection.mutable.ArraySeq$ofRef] will be passed.  (was: 
ResolvedRFormula.toString throws ClassCastException with Scala 2.13 because the 
type of ResolvedRFormula.terms is 
scala.collection.immutable.Seq[scala.collection.imutable.Seq[String]] but 
scala.collection.immutable.Seq[scala.collection.mutable.ArraySeq$ofRef] will be 
passed.)

> ResolvedRFormula.toString throws ClassCastException with Scala 2.13
> ---
>
> Key: SPARK-37026
> URL: https://issues.apache.org/jira/browse/SPARK-37026
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> ResolvedRFormula.toString throws ClassCastException with Scala 2.13 because 
> the type of ResolvedRFormula.terms is scala.Seq[scala.Seq[String]] but 
> scala.Seq[scala.collection.mutable.ArraySeq$ofRef] will be passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37026) ResolvedRFormula.toString throws ClassCastException with Scala 2.13

2021-10-16 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-37026:
--

 Summary: ResolvedRFormula.toString throws ClassCastException with 
Scala 2.13
 Key: SPARK-37026
 URL: https://issues.apache.org/jira/browse/SPARK-37026
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


ResolvedRFormula.toString throws ClassCastException with Scala 2.13 because the 
type of ResolvedRFormula.terms is 
scala.collection.immutable.Seq[scala.collection.imutable.Seq[String]] but 
scala.collection.immutable.Seq[scala.collection.mutable.ArraySeq$ofRef] will be 
passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37026) ResolvedRFormula.toString throws ClassCastException with Scala 2.13

2021-10-16 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-37026:
---
Issue Type: Bug  (was: Improvement)

> ResolvedRFormula.toString throws ClassCastException with Scala 2.13
> ---
>
> Key: SPARK-37026
> URL: https://issues.apache.org/jira/browse/SPARK-37026
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> ResolvedRFormula.toString throws ClassCastException with Scala 2.13 because 
> the type of ResolvedRFormula.terms is 
> scala.collection.immutable.Seq[scala.collection.imutable.Seq[String]] but 
> scala.collection.immutable.Seq[scala.collection.mutable.ArraySeq$ofRef] will 
> be passed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36966) Spark evicts RDD partitions instead of allowing OOM

2021-10-16 Thread sam (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam updated SPARK-36966:

Affects Version/s: 1.6.0

> Spark evicts RDD partitions instead of allowing OOM
> ---
>
> Key: SPARK-36966
> URL: https://issues.apache.org/jira/browse/SPARK-36966
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.4.4
>Reporter: sam
>Priority: Major
>
> In the past Spark (pre 1.6) jobs would give OOM if an RDD could not fit into 
> memory (when trying to cache with MEMORY_ONLY). These days it seems Spark 
> jobs will evict partitions from the cache and recompute them from scratch.
> We have some jobs that cache an RDD then traverse it 300 times.  In order to 
> know when we need to increase the memory on our cluster, we need to know when 
> it has run out of memory.
> The new behaviour of Spark makes this difficult ... rather than the job 
> OOMing (like in the past), the job instead just takes forever (our 
> surrounding logic eventually times out the job).  Diagnosing why the job 
> failed becomes difficult because it's not immediately obvious from the logs 
> that the job has run out of memory (since no OOM is thrown).  One can find 
> "evicted" log lines.
> As a bit of a hack, we are using accumulators with mapPartitionsWithIndex and 
> updating a count of the number of times each partition is traversed, then we 
> forcibly blow up the job when this count is 2 or more (indicating an 
> eviction).  This hack doesn't work very well as it seems to give false 
> positives (RCA not yet understood, it doesn't seem to be speculative 
> execution, nor are tasks failing (`egrep "task .* in stage [0-9]+\.1" | wc 
> -l` gives 0).
> *Question 1*: Is there a way to disable this new behaviour of Spark and make 
> it behave like it used to (i.e. just blow up with OOM) - I've looked in the 
> Spark configuration and cannot find anything like "disable-eviction".
> *Question 2*: Is there any method on the SparkSession or SparkContext that we 
> can call to easily detect when eviction is happening?
> If not, then for our use case this is an effective regression - we need a way 
> to make Spark behave predictably, or at least a way to determine 
> automatically when Spark is running slowly due to lack of memory.
> FOR CONTEXT
> > Borrowed storage memory may be evicted when memory pressure arises
> From 
> https://issues.apache.org/jira/secure/attachment/12765646/unified-memory-management-spark-1.pdf,
>  which is attached to https://issues.apache.org/jira/browse/SPARK-1 which 
> was implemented in 1.6 
> https://spark.apache.org/releases/spark-release-1-6-0.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36966) Spark evicts RDD partitions instead of allowing OOM

2021-10-16 Thread sam (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam updated SPARK-36966:

Description: 
In the past Spark (pre 1.6) jobs would give OOM if an RDD could not fit into 
memory (when trying to cache with MEMORY_ONLY). These days it seems Spark jobs 
will evict partitions from the cache and recompute them from scratch.

We have some jobs that cache an RDD then traverse it 300 times.  In order to 
know when we need to increase the memory on our cluster, we need to know when 
it has run out of memory.

The new behaviour of Spark makes this difficult ... rather than the job OOMing 
(like in the past), the job instead just takes forever (our surrounding logic 
eventually times out the job).  Diagnosing why the job failed becomes difficult 
because it's not immediately obvious from the logs that the job has run out of 
memory (since no OOM is thrown).  One can find "evicted" log lines.

As a bit of a hack, we are using accumulators with mapPartitionsWithIndex and 
updating a count of the number of times each partition is traversed, then we 
forcibly blow up the job when this count is 2 or more (indicating an eviction). 
 This hack doesn't work very well as it seems to give false positives (RCA not 
yet understood, it doesn't seem to be speculative execution, nor are tasks 
failing (`egrep "task .* in stage [0-9]+\.1" | wc -l` gives 0).

*Question 1*: Is there a way to disable this new behaviour of Spark and make it 
behave like it used to (i.e. just blow up with OOM) - I've looked in the Spark 
configuration and cannot find anything like "disable-eviction".

*Question 2*: Is there any method on the SparkSession or SparkContext that we 
can call to easily detect when eviction is happening?

If not, then for our use case this is an effective regression - we need a way 
to make Spark behave predictably, or at least a way to determine automatically 
when Spark is running slowly due to lack of memory.


FOR CONTEXT

> Borrowed storage memory may be evicted when memory pressure arises

>From 
>https://issues.apache.org/jira/secure/attachment/12765646/unified-memory-management-spark-1.pdf,
> which is attached to https://issues.apache.org/jira/browse/SPARK-1 which 
>was implemented in 1.6 
>https://spark.apache.org/releases/spark-release-1-6-0.html


  was:
In the past Spark jobs would give OOM if an RDD could not fit into memory (when 
trying to cache with MEMORY_ONLY). These days it seems Spark jobs will evict 
partitions from the cache and recompute them from scratch.

We have some jobs that cache an RDD then traverse it 300 times.  In order to 
know when we need to increase the memory on our cluster, we need to know when 
it has run out of memory.

The new behaviour of Spark makes this difficult ... rather than the job OOMing 
(like in the past), the job instead just takes forever (our surrounding logic 
eventually times out the job).  Diagnosing why the job failed becomes difficult 
because it's not immediately obvious from the logs that the job has run out of 
memory (since no OOM is thrown).  One can find "evicted" log lines.

As a bit of a hack, we are using accumulators with mapPartitionsWithIndex and 
updating a count of the number of times each partition is traversed, then we 
forcably blow up the job when this count is 2 or more (indicating an 
evicition).  This hack doesn't work very well as it seems to give false 
positives (RCA not yet understood, it doesn't seem to be speculative execution, 
nor are tasks failing (`egrep "task .* in stage [0-9]+\.1" | wc -l` gives 0).

*Question 1*: Is there a way to disable this new behaviour of Spark and make it 
behave like it used to (i.e. just blow up with OOM) - I've looked in the Spark 
configuration and cannot find anything like "disable-eviction".

*Question 2*: Is there any method on the SparkSession or SparkContext that we 
can call to easily detect when eviction is happening?

If not, then for our use case this is an effective regression - we need a way 
to make Spark behave predictably, or at least a way to determine automatically 
when Spark is running slowly due to lack of memory.


FOR CONTEXT

> Borrowed storage memory may be evicted when memory pressure arises

>From 
>https://issues.apache.org/jira/secure/attachment/12765646/unified-memory-management-spark-1.pdf,
> which is attached to https://issues.apache.org/jira/browse/SPARK-1 which 
>was implemented in 1.6 
>https://spark.apache.org/releases/spark-release-1-6-0.html



> Spark evicts RDD partitions instead of allowing OOM
> ---
>
> Key: SPARK-36966
> URL: https://issues.apache.org/jira/browse/SPARK-36966
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: sam
>Priority: Major
>
> In the past Spark (

[jira] [Updated] (SPARK-36966) Spark evicts RDD partitions instead of allowing OOM

2021-10-16 Thread sam (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam updated SPARK-36966:

Description: 
In the past Spark jobs would give OOM if an RDD could not fit into memory (when 
trying to cache with MEMORY_ONLY). These days it seems Spark jobs will evict 
partitions from the cache and recompute them from scratch.

We have some jobs that cache an RDD then traverse it 300 times.  In order to 
know when we need to increase the memory on our cluster, we need to know when 
it has run out of memory.

The new behaviour of Spark makes this difficult ... rather than the job OOMing 
(like in the past), the job instead just takes forever (our surrounding logic 
eventually times out the job).  Diagnosing why the job failed becomes difficult 
because it's not immediately obvious from the logs that the job has run out of 
memory (since no OOM is thrown).  One can find "evicted" log lines.

As a bit of a hack, we are using accumulators with mapPartitionsWithIndex and 
updating a count of the number of times each partition is traversed, then we 
forcably blow up the job when this count is 2 or more (indicating an 
evicition).  This hack doesn't work very well as it seems to give false 
positives (RCA not yet understood, it doesn't seem to be speculative execution, 
nor are tasks failing (`egrep "task .* in stage [0-9]+\.1" | wc -l` gives 0).

*Question 1*: Is there a way to disable this new behaviour of Spark and make it 
behave like it used to (i.e. just blow up with OOM) - I've looked in the Spark 
configuration and cannot find anything like "disable-eviction".

*Question 2*: Is there any method on the SparkSession or SparkContext that we 
can call to easily detect when eviction is happening?

If not, then for our use case this is an effective regression - we need a way 
to make Spark behave predictably, or at least a way to determine automatically 
when Spark is running slowly due to lack of memory.


FOR CONTEXT

> Borrowed storage memory may be evicted when memory pressure arises

>From 
>https://issues.apache.org/jira/secure/attachment/12765646/unified-memory-management-spark-1.pdf,
> which is attached to https://issues.apache.org/jira/browse/SPARK-1 which 
>was implemented in 1.6 
>https://spark.apache.org/releases/spark-release-1-6-0.html


  was:
In the past Spark jobs would give OOM if an RDD could not fit into memory (when 
trying to cache with MEMORY_ONLY). These days it seems Spark jobs will evict 
partitions from the cache and recompute them from scratch.

We have some jobs that cache an RDD then traverse it 300 times.  In order to 
know when we need to increase the memory on our cluster, we need to know when 
it has run out of memory.

The new behaviour of Spark makes this difficult ... rather than the job OOMing 
(like in the past), the job instead just takes forever (our surrounding logic 
eventually times out the job).  Diagnosing why the job failed becomes difficult 
because it's not immediately obvious from the logs that the job has run out of 
memory (since no OOM is thrown).  One can find "evicted" log lines.

As a bit of a hack, we are using accumulators with mapPartitionsWithIndex and 
updating a count of the number of times each partition is traversed, then we 
forcably blow up the job when this count is 2 or more (indicating an 
evicition).  This hack doesn't work very well as it seems to give false 
positives (RCA not yet understood, it doesn't seem to be speculative execution, 
nor are tasks failing (`egrep "task .* in stage [0-9]+\.1" | wc -l` gives 0).

*Question 1*: Is there a way to disable this new behaviour of Spark and make it 
behave like it used to (i.e. just blow up with OOM) - I've looked in the Spark 
configuration and cannot find anything like "disable-eviction".

*Question 2*: Is there any method on the SparkSession or SparkContext that we 
can call to easily detect when eviction is happening?

If not, then for our use case this is an effective regression - we need a way 
to make Spark behave predictably, or at least a way to determine automatically 
when Spark is running slowly due to lack of memory.




> Spark evicts RDD partitions instead of allowing OOM
> ---
>
> Key: SPARK-36966
> URL: https://issues.apache.org/jira/browse/SPARK-36966
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: sam
>Priority: Major
>
> In the past Spark jobs would give OOM if an RDD could not fit into memory 
> (when trying to cache with MEMORY_ONLY). These days it seems Spark jobs will 
> evict partitions from the cache and recompute them from scratch.
> We have some jobs that cache an RDD then traverse it 300 times.  In order to 
> know when we need to increase the memory on our cluster, we need to know when 
> i

[jira] [Assigned] (SPARK-37025) Upgrade RoaringBitmap to 0.9.22

2021-10-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37025:


Assignee: Apache Spark  (was: Lantao Jin)

> Upgrade RoaringBitmap to 0.9.22
> ---
>
> Key: SPARK-37025
> URL: https://issues.apache.org/jira/browse/SPARK-37025
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Igor Dvorzhak
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37025) Upgrade RoaringBitmap to 0.9.22

2021-10-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37025:


Assignee: Lantao Jin  (was: Apache Spark)

> Upgrade RoaringBitmap to 0.9.22
> ---
>
> Key: SPARK-37025
> URL: https://issues.apache.org/jira/browse/SPARK-37025
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Igor Dvorzhak
>Assignee: Lantao Jin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37025) Upgrade RoaringBitmap to 0.9.22

2021-10-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429583#comment-17429583
 ] 

Apache Spark commented on SPARK-37025:
--

User 'medb' has created a pull request for this issue:
https://github.com/apache/spark/pull/34300

> Upgrade RoaringBitmap to 0.9.22
> ---
>
> Key: SPARK-37025
> URL: https://issues.apache.org/jira/browse/SPARK-37025
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Igor Dvorzhak
>Assignee: Lantao Jin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37025) Upgrade RoaringBitmap to 0.9.22

2021-10-16 Thread Igor Dvorzhak (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Dvorzhak updated SPARK-37025:
--
Labels:   (was: correctness)

> Upgrade RoaringBitmap to 0.9.22
> ---
>
> Key: SPARK-37025
> URL: https://issues.apache.org/jira/browse/SPARK-37025
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Igor Dvorzhak
>Assignee: Lantao Jin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37025) Upgrade RoaringBitmap to 0.9.22

2021-10-16 Thread Igor Dvorzhak (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Dvorzhak updated SPARK-37025:
--
Fix Version/s: (was: 2.4.2)
   (was: 2.3.4)
   (was: 3.0.0)
Affects Version/s: (was: 2.3.3)
   (was: 2.4.0)
   (was: 3.0.0)
   (was: 2.2.0)
   (was: 2.1.0)
   (was: 2.0.0)
   3.2.0
  Description: (was: HighlyCompressedMapStatus uses RoaringBitmap 
to record the empty blocks. But RoaringBitmap-0.5.11 couldn't be ser/deser with 
unsafe KryoSerializer.

We can use below UT to reproduce:
{code}
  test("kryo serialization with RoaringBitmap") {
val bitmap = new RoaringBitmap
bitmap.add(1787)

val safeSer = new KryoSerializer(conf).newInstance()
val bitmap2 : RoaringBitmap = safeSer.deserialize(safeSer.serialize(bitmap))
assert(bitmap2.equals(bitmap))

conf.set("spark.kryo.unsafe", "true")
val unsafeSer = new KryoSerializer(conf).newInstance()
val bitmap3 : RoaringBitmap = 
unsafeSer.deserialize(unsafeSer.serialize(bitmap))
assert(bitmap3.equals(bitmap)) // this will fail
  }
{code}
Upgrade to latest version 0.7.45 to fix it)

> Upgrade RoaringBitmap to 0.9.22
> ---
>
> Key: SPARK-37025
> URL: https://issues.apache.org/jira/browse/SPARK-37025
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Igor Dvorzhak
>Assignee: Lantao Jin
>Priority: Major
>  Labels: correctness
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37025) Upgrade RoaringBitmap to 0.9.22

2021-10-16 Thread Igor Dvorzhak (Jira)
Igor Dvorzhak created SPARK-37025:
-

 Summary: Upgrade RoaringBitmap to 0.9.22
 Key: SPARK-37025
 URL: https://issues.apache.org/jira/browse/SPARK-37025
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.3, 2.4.0, 3.0.0
Reporter: Igor Dvorzhak
Assignee: Lantao Jin
 Fix For: 2.3.4, 2.4.2, 3.0.0


HighlyCompressedMapStatus uses RoaringBitmap to record the empty blocks. But 
RoaringBitmap-0.5.11 couldn't be ser/deser with unsafe KryoSerializer.

We can use below UT to reproduce:
{code}
  test("kryo serialization with RoaringBitmap") {
val bitmap = new RoaringBitmap
bitmap.add(1787)

val safeSer = new KryoSerializer(conf).newInstance()
val bitmap2 : RoaringBitmap = safeSer.deserialize(safeSer.serialize(bitmap))
assert(bitmap2.equals(bitmap))

conf.set("spark.kryo.unsafe", "true")
val unsafeSer = new KryoSerializer(conf).newInstance()
val bitmap3 : RoaringBitmap = 
unsafeSer.deserialize(unsafeSer.serialize(bitmap))
assert(bitmap3.equals(bitmap)) // this will fail
  }
{code}
Upgrade to latest version 0.7.45 to fix it



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27312) PropertyGraph <-> GraphX conversions

2021-10-16 Thread Munish Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429577#comment-17429577
 ] 

Munish Sharma commented on SPARK-27312:
---

[~mengxr] can I start to work on this ? If so, please assign it to me.

> PropertyGraph <-> GraphX conversions
> 
>
> Key: SPARK-27312
> URL: https://issues.apache.org/jira/browse/SPARK-27312
> Project: Spark
>  Issue Type: Story
>  Components: Graph, GraphX
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> As a user, I can convert a GraphX graph into a PropertyGraph and a 
> PropertyGraph into a GraphX graph if they are compatible.
> * Scala only
> * Whether this is an internal API is pending design discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36992) Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray

2021-10-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-36992.
--
Fix Version/s: 3.3.0
 Assignee: XiDuo You
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/34267

> Improve byte array sort perf by unify getPrefix function of UTF8String and 
> ByteArray
> 
>
> Key: SPARK-36992
> URL: https://issues.apache.org/jira/browse/SPARK-36992
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.3.0
>
>
> When execute sort operator, we first compare the prefix. However the 
> getPrefix function of byte array is slow. We use first 8 bytes as the prefix, 
> so at most we will call 8 times with `Platform.getByte` which is slower than 
> call once with `Platform.getInt` or `Platform.getLong`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36992) Improve byte array sort perf by unify getPrefix function of UTF8String and ByteArray

2021-10-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-36992:
-
Priority: Minor  (was: Major)

> Improve byte array sort perf by unify getPrefix function of UTF8String and 
> ByteArray
> 
>
> Key: SPARK-36992
> URL: https://issues.apache.org/jira/browse/SPARK-36992
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Minor
> Fix For: 3.3.0
>
>
> When execute sort operator, we first compare the prefix. However the 
> getPrefix function of byte array is slow. We use first 8 bytes as the prefix, 
> so at most we will call 8 times with `Platform.getByte` which is slower than 
> call once with `Platform.getInt` or `Platform.getLong`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37008) WholeStageCodegenSparkSubmitSuite Failed with Java 17

2021-10-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-37008:


Assignee: Yang Jie

> WholeStageCodegenSparkSubmitSuite Failed with Java 17 
> --
>
> Key: SPARK-37008
> URL: https://issues.apache.org/jira/browse/SPARK-37008
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> WholeStageCodegenSparkSubmitSuite test failed when use Java 17
> {code:java}
>  2021-10-14 04:32:38.038 - stderr> Exception in thread "main" 
> org.scalatest.exceptions.TestFailedException: 16 was not greater than 16
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.sql.execution.WholeStageCodegenSparkSubmitSuite$.main(WholeStageCodegenSparkSubmitSuite.scala:82)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.sql.execution.WholeStageCodegenSparkSubmitSuite.main(WholeStageCodegenSparkSubmitSuite.scala)
>   2021-10-14 04:32:38.038 - stderr>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   2021-10-14 04:32:38.038 - stderr>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>   2021-10-14 04:32:38.038 - stderr>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   2021-10-14 04:32:38.038 - stderr>   at 
> java.base/java.lang.reflect.Method.invoke(Method.java:568)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37008) WholeStageCodegenSparkSubmitSuite Failed with Java 17

2021-10-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-37008:
-
Priority: Minor  (was: Major)

> WholeStageCodegenSparkSubmitSuite Failed with Java 17 
> --
>
> Key: SPARK-37008
> URL: https://issues.apache.org/jira/browse/SPARK-37008
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>
> WholeStageCodegenSparkSubmitSuite test failed when use Java 17
> {code:java}
>  2021-10-14 04:32:38.038 - stderr> Exception in thread "main" 
> org.scalatest.exceptions.TestFailedException: 16 was not greater than 16
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.sql.execution.WholeStageCodegenSparkSubmitSuite$.main(WholeStageCodegenSparkSubmitSuite.scala:82)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.sql.execution.WholeStageCodegenSparkSubmitSuite.main(WholeStageCodegenSparkSubmitSuite.scala)
>   2021-10-14 04:32:38.038 - stderr>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   2021-10-14 04:32:38.038 - stderr>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>   2021-10-14 04:32:38.038 - stderr>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   2021-10-14 04:32:38.038 - stderr>   at 
> java.base/java.lang.reflect.Method.invoke(Method.java:568)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37008) WholeStageCodegenSparkSubmitSuite Failed with Java 17

2021-10-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-37008.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34286
[https://github.com/apache/spark/pull/34286]

> WholeStageCodegenSparkSubmitSuite Failed with Java 17 
> --
>
> Key: SPARK-37008
> URL: https://issues.apache.org/jira/browse/SPARK-37008
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.3.0
>
>
> WholeStageCodegenSparkSubmitSuite test failed when use Java 17
> {code:java}
>  2021-10-14 04:32:38.038 - stderr> Exception in thread "main" 
> org.scalatest.exceptions.TestFailedException: 16 was not greater than 16
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.sql.execution.WholeStageCodegenSparkSubmitSuite$.main(WholeStageCodegenSparkSubmitSuite.scala:82)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.sql.execution.WholeStageCodegenSparkSubmitSuite.main(WholeStageCodegenSparkSubmitSuite.scala)
>   2021-10-14 04:32:38.038 - stderr>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   2021-10-14 04:32:38.038 - stderr>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>   2021-10-14 04:32:38.038 - stderr>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   2021-10-14 04:32:38.038 - stderr>   at 
> java.base/java.lang.reflect.Method.invoke(Method.java:568)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
>   2021-10-14 04:32:38.038 - stderr>   at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17

2021-10-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-36900:


Assignee: (was: Yang Jie)

> "SPARK-36464: size returns correct positive number even with over 2GB data" 
> will oom with JDK17 
> 
>
> Key: SPARK-36900
> URL: https://issues.apache.org/jira/browse/SPARK-36900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>
> Execute
>  
> {code:java}
> build/mvn clean install  -pl core -am -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite
> {code}
> with JDK 17,
> {code:java}
> ChunkedByteBufferOutputStreamSuite:
> - empty output
> - write a single byte
> - write a single near boundary
> - write a single at boundary
> - single chunk output
> - single chunk output at boundary size
> - multiple chunk output
> - multiple chunk output at boundary size
> *** RUN ABORTED ***
>   java.lang.OutOfMemoryError: Java heap space
>   at java.base/java.lang.Integer.valueOf(Integer.java:1081)
>   at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
>   at java.base/java.io.OutputStream.write(OutputStream.java:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown
>  Source)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17

2021-10-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-36900.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34284
[https://github.com/apache/spark/pull/34284]

> "SPARK-36464: size returns correct positive number even with over 2GB data" 
> will oom with JDK17 
> 
>
> Key: SPARK-36900
> URL: https://issues.apache.org/jira/browse/SPARK-36900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>
> Execute
>  
> {code:java}
> build/mvn clean install  -pl core -am -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite
> {code}
> with JDK 17,
> {code:java}
> ChunkedByteBufferOutputStreamSuite:
> - empty output
> - write a single byte
> - write a single near boundary
> - write a single at boundary
> - single chunk output
> - single chunk output at boundary size
> - multiple chunk output
> - multiple chunk output at boundary size
> *** RUN ABORTED ***
>   java.lang.OutOfMemoryError: Java heap space
>   at java.base/java.lang.Integer.valueOf(Integer.java:1081)
>   at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
>   at java.base/java.io.OutputStream.write(OutputStream.java:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown
>  Source)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17

2021-10-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-36900:


Assignee: Yang Jie

> "SPARK-36464: size returns correct positive number even with over 2GB data" 
> will oom with JDK17 
> 
>
> Key: SPARK-36900
> URL: https://issues.apache.org/jira/browse/SPARK-36900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> Execute
>  
> {code:java}
> build/mvn clean install  -pl core -am -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite
> {code}
> with JDK 17,
> {code:java}
> ChunkedByteBufferOutputStreamSuite:
> - empty output
> - write a single byte
> - write a single near boundary
> - write a single at boundary
> - single chunk output
> - single chunk output at boundary size
> - multiple chunk output
> - multiple chunk output at boundary size
> *** RUN ABORTED ***
>   java.lang.OutOfMemoryError: Java heap space
>   at java.base/java.lang.Integer.valueOf(Integer.java:1081)
>   at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
>   at java.base/java.io.OutputStream.write(OutputStream.java:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown
>  Source)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36915) Pin actions to a full length commit SHA

2021-10-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-36915:
-
Issue Type: Improvement  (was: Bug)
  Priority: Minor  (was: Major)

> Pin actions to a full length commit SHA
> ---
>
> Key: SPARK-36915
> URL: https://issues.apache.org/jira/browse/SPARK-36915
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.2
>Reporter: Naveen S
>Assignee: Naveen S
>Priority: Minor
> Fix For: 3.3.0
>
>
> Pinning an action to a full-length commit SHA is currently the only way to 
> use an action as
> an immutable release. Pinning to a particular SHA helps mitigate the risk of 
> a bad actor adding a
> backdoor to the action's repository, as they would need to generate a SHA-1 
> collision for
> a valid Git object payload.
> [https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#using-third-party-actions]
> [https://github.com/ossf/scorecard/blob/main/docs/checks.md#pinned-dependencies]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36915) Pin actions to a full length commit SHA

2021-10-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-36915.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34163
[https://github.com/apache/spark/pull/34163]

> Pin actions to a full length commit SHA
> ---
>
> Key: SPARK-36915
> URL: https://issues.apache.org/jira/browse/SPARK-36915
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.2
>Reporter: Naveen S
>Assignee: Naveen S
>Priority: Major
> Fix For: 3.3.0
>
>
> Pinning an action to a full-length commit SHA is currently the only way to 
> use an action as
> an immutable release. Pinning to a particular SHA helps mitigate the risk of 
> a bad actor adding a
> backdoor to the action's repository, as they would need to generate a SHA-1 
> collision for
> a valid Git object payload.
> [https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#using-third-party-actions]
> [https://github.com/ossf/scorecard/blob/main/docs/checks.md#pinned-dependencies]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36915) Pin actions to a full length commit SHA

2021-10-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-36915:


Assignee: Naveen S

> Pin actions to a full length commit SHA
> ---
>
> Key: SPARK-36915
> URL: https://issues.apache.org/jira/browse/SPARK-36915
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.2
>Reporter: Naveen S
>Assignee: Naveen S
>Priority: Major
>
> Pinning an action to a full-length commit SHA is currently the only way to 
> use an action as
> an immutable release. Pinning to a particular SHA helps mitigate the risk of 
> a bad actor adding a
> backdoor to the action's repository, as they would need to generate a SHA-1 
> collision for
> a valid Git object payload.
> [https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions#using-third-party-actions]
> [https://github.com/ossf/scorecard/blob/main/docs/checks.md#pinned-dependencies]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37024) Even more decimal overflow issues in average

2021-10-16 Thread Robert Joseph Evans (Jira)
Robert Joseph Evans created SPARK-37024:
---

 Summary: Even more decimal overflow issues in average
 Key: SPARK-37024
 URL: https://issues.apache.org/jira/browse/SPARK-37024
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Robert Joseph Evans


As a part of trying to accelerate the {{Decimal}} average aggregation on a 
[GPU|https://nvidia.github.io/spark-rapids/] I noticed a few issues around 
overflow. I think all of these can be fixed by replacing {{Average}} with 
explicit {{Sum}}, {{Count}}, and {{Divide}} operations for decimal instead of 
implicitly doing them. But the extra checks would come with a performance cost.

This is related to SPARK-35955, but goes quite a bit beyond it.
 # There are no ANSI overflow checks on the summation portion of average.
 # Nulls are inserted/overflow is detected on summation differently depending 
on code generation and parallelism.
 # If the input decimal precision is 11 or below all overflow checks are 
disabled, and the answer is wrong instead of null on overflow.

*Details:*

*there are no ANSI overflow checks on the summation portion.*
{code:scala}
scala> spark.conf.set("spark.sql.ansi.enabled", "true")

scala> spark.time(spark.range(201)
.repartition(2)
.selectExpr("id", "CAST('' AS DECIMAL(32, 
0)) as v")
.selectExpr("AVG(v)")
.show(truncate = false))
+--+
|avg(v)|
+--+
|null  |
+--+

Time taken: 622 ms

scala> spark.time(spark.range(201)
.repartition(2)
.selectExpr("id", "CAST('' AS DECIMAL(32, 
0)) as v")
.selectExpr("SUM(v)")
.show(truncate = false))
21/10/16 06:08:00 ERROR Executor: Exception in task 0.0 in stage 15.0 (TID 19)
java.lang.ArithmeticException: Overflow in sum of decimals.
...
{code}
*nulls are inserted on summation overflow differently depending on code 
generation and parallelism.*

Because there are no explicit overflow checks when doing the sum a user can get 
very inconsistent results for when a null is inserted on overflow. The checks 
really only take place when the {{Decimal}} value is converted and stored into 
an {{UnsafeRow}}.  This happens when the values are shuffled, or after each 
operation if code gen is disabled.  For a {{DECIMAL(32, 0)}} you can add 
1,000,000 max values before the summation overflows.
{code:scala}
scala> spark.time(spark.range(100)
.coalesce(2)
.selectExpr("id", "CAST('' AS DECIMAL(32, 
0)) as v")
.selectExpr("SUM(v) as s", "COUNT(v) as c", "AVG(v) as a")
.selectExpr("s", "c", "s/c as sum_div_count", "a")
.show(truncate = false))
+--+---+---+-+
|s |c  |sum_div_count   
   |a|
+--+---+---+-+
|00|100|.00|.|
+--+---+---+-+
Time taken: 241 ms

scala> spark.time(spark.range(200)
.coalesce(2)
.selectExpr("id", "CAST('' AS DECIMAL(32, 
0)) as v")
.selectExpr("SUM(v) as s", "COUNT(v) as c", "AVG(v) as a")
.selectExpr("s", "c", "s/c as sum_div_count", "a")
.show(truncate = false))
++---+-+-+
|s   |c  |sum_div_count|a|
++---+-+-+
|null|200|null |.|
++---+-+-+
Time taken: 228 ms

scala> spark.time(spark.range(300)
.coalesce(2)
.selectExpr("id", "CAST('' AS DECIMAL(32, 
0)) as v")
.selectExpr("SUM(v) as s", "COUNT(v) as c", "AVG(v) as a")
.selectExpr("s", "c", "s/c as sum_div_count", "a")
.show(truncate = false))
++---+-++
|s   |c  |sum_div_count|a   |
++---+-++
|null|300|null |null|
++---+-++
Time taken: 347 ms

scala> spark.conf.set("spark.sql.codegen.wholeStage", "false")
scala> spark.time(spark.range(101)
.coalesce(2)
.selectExpr("id", "CAST('' AS DECIMAL(32, 
0)) as v")
.selectExpr("SUM(v) as s", "COUNT(v) as c", "AVG(v) as a")
.selectExpr("s", "c", "s/c as sum_div_count", "a")
.show(truncate = false))
++---+-++
|s   |c 

[jira] [Updated] (SPARK-37021) JDBC option "sessionInitStatement" is ignored in resolveTable

2021-10-16 Thread Valery Meleshkin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valery Meleshkin updated SPARK-37021:
-
Summary: JDBC option "sessionInitStatement" is ignored in resolveTable  
(was: JDBC option "sessionInitStatement" does not execute set sql statement 
when resolving a table)

> JDBC option "sessionInitStatement" is ignored in resolveTable
> -
>
> Key: SPARK-37021
> URL: https://issues.apache.org/jira/browse/SPARK-37021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.2
>Reporter: Valery Meleshkin
>Priority: Major
>
> If {{sessionInitStatement}} is required to grant permissions or resolve an 
> ambiguity, schema resolution will fail when reading a JDBC table.
> Consider the following example running against Oracle database:
> {code:scala}
> reader.format("jdbc").options(
>   Map(
> "url" -> jdbcUrl,
> "dbtable" -> "FOO",
> "user" -> "BOB",
> "sessionInitStatement" -> """ALTER SESSION SET CURRENT_SCHEMA = "BAR,
> "password" -> password
>   )).load
> {code}
> Table {{FOO}} is in schema {{BAR}}, but default value for {{CURRENT_SCHEMA}} 
> for the JDBC connection will be {{BOB}}. Therefore, the code above will fail 
> with an error ({{ORA-00942: table or view does not exist}} if it's Oracle). 
> It happens because [resolveTable 
> |https://github.com/apache/spark/blob/9d061e3939a021c602c070fc13cef951a8f94c82/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L67]
> that is called during planning phase ignores {{sessionInitStatement}}.
> This might sound like an artificial example, in a simple case like the one 
> above it's easy enough to specify {{"BOB.FOO"}} as {{dbtable}}. But when 
> {{sessionInitStatement}} contains a more complicated setup it might not be as 
> straightforward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37021) JDBC option "sessionInitStatement" does not execute set sql statement when resolving a table

2021-10-16 Thread Valery Meleshkin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valery Meleshkin updated SPARK-37021:
-
Description: 
If {{sessionInitStatement}} is required to grant permissions or resolve an 
ambiguity, schema resolution will fail when reading a JDBC table.

Consider the following example running against Oracle database:

{code:scala}
reader.format("jdbc").options(
  Map(
"url" -> jdbcUrl,
"dbtable" -> "FOO",
"user" -> "BOB",
"sessionInitStatement" -> """ALTER SESSION SET CURRENT_SCHEMA = "BAR,
"password" -> password
  )).load
{code}

Table {{FOO}} is in schema {{BAR}}, but default value for {{CURRENT_SCHEMA}} 
for the JDBC connection will be {{BOB}}. Therefore, the code above will fail 
with an error ({{ORA-00942: table or view does not exist}} if it's Oracle). It 
happens because [resolveTable 
|https://github.com/apache/spark/blob/9d061e3939a021c602c070fc13cef951a8f94c82/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L67]
that is called during planning phase ignores {{sessionInitStatement}}.

This might sound like an artificial example, in a simple case like the one 
above it's easy enough to specify {{"BOB.FOO"}} as {{dbtable}}. But when 
{{sessionInitStatement}} contains a more complicated setup it might not be as 
straightforward.

  was:
If {{sessionInitStatement}} is required to grant permissions or resolve an 
ambiguity, schema resolution will fail when reading a JDBC table.

Consider the following example running against Oracle database:

{code:scala}
reader.format("jdbc").options(
  Map(
"url" -> jdbcUrl,
"dbtable" -> "SELECT * FROM FOO",
"user" -> "BOB",
"sessionInitStatement" -> """ALTER SESSION SET CURRENT_SCHEMA = "BAR,
"password" -> password
  )).load
{code}

Table {{FOO}} is in schema {{BAR}}, but default value for {{CURRENT_SCHEMA}} 
for the JDBC connection will be {{BOB}}. Therefore, the code above will fail 
with an error ({{ORA-00942: table or view does not exist}} if it's Oracle). It 
happens because [resolveTable 
|https://github.com/apache/spark/blob/9d061e3939a021c602c070fc13cef951a8f94c82/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L67]
that is called during planning phase ignores {{sessionInitStatement}}.


> JDBC option "sessionInitStatement" does not execute set sql statement when 
> resolving a table
> 
>
> Key: SPARK-37021
> URL: https://issues.apache.org/jira/browse/SPARK-37021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.2
>Reporter: Valery Meleshkin
>Priority: Major
>
> If {{sessionInitStatement}} is required to grant permissions or resolve an 
> ambiguity, schema resolution will fail when reading a JDBC table.
> Consider the following example running against Oracle database:
> {code:scala}
> reader.format("jdbc").options(
>   Map(
> "url" -> jdbcUrl,
> "dbtable" -> "FOO",
> "user" -> "BOB",
> "sessionInitStatement" -> """ALTER SESSION SET CURRENT_SCHEMA = "BAR,
> "password" -> password
>   )).load
> {code}
> Table {{FOO}} is in schema {{BAR}}, but default value for {{CURRENT_SCHEMA}} 
> for the JDBC connection will be {{BOB}}. Therefore, the code above will fail 
> with an error ({{ORA-00942: table or view does not exist}} if it's Oracle). 
> It happens because [resolveTable 
> |https://github.com/apache/spark/blob/9d061e3939a021c602c070fc13cef951a8f94c82/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L67]
> that is called during planning phase ignores {{sessionInitStatement}}.
> This might sound like an artificial example, in a simple case like the one 
> above it's easy enough to specify {{"BOB.FOO"}} as {{dbtable}}. But when 
> {{sessionInitStatement}} contains a more complicated setup it might not be as 
> straightforward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.

2021-10-16 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-37022:
---
Description: 
[{{black}}|https://github.com/psf/black] is a popular Python code formatter. It 
is used by a number of projects, both small and large, including prominent 
ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used to 
format a {{pyspark.pandas}} and (though not enforced) stubs files.

We should consider using black to enforce formatting of all PySpark files. 
There are multiple reasons to do that:
 - Consistency: black is already used across existing codebase and black 
formatted chunks of code are already added to modules other than pyspark.pandas 
as a result of type hints inlining (SPARK-36845).
 - Lower cost of contributing and reviewing: Formatting can be automatically 
enforced and applied.
 - Simplify reviews: In general, black formatted code, produces small and 
highly readable diffs.
 - Reduce effort required to maintain patched forks: smaller diffs + 
predictable formatting.

Risks:
 - Initial reformatting requires quite significant changes.
 - Applying black will break blame in GitHub UI (for git in general see 
[Avoiding ruining git 
blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).

Additional steps:
 - To simplify backporting, black will have to be applied to all active 
branches.

  was:
[{{black}}|https://github.com/psf/black] is a popular Python code formatter. It 
is used by a number of projects, both small and large, including prominent 
ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used to 
format a {{pyspark.pandas}} and (though not enforced) stubs files.

We should consider using black to enforce formatting of all PySpark files. 
There are multiple reasons to do that:

- Consistency: black is already used across existing codebase and black 
formatted chunks of code are already added to modules other than pyspark.pandas 
as a result of type hints inlining (SPARK-36845).
- Lower cost of contributing and reviewing: Formatting can be automatically 
enforced and applied.
- Simplify reviews: In general, black formatted code, produces small and highly 
readable diffs.

Risks:

- Initial reformatting requires quite significant changes.
- Applying black will break blame in GitHub UI (for git in general see 
[Avoiding ruining git 
blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).

Additional steps:

- To simplify backporting, black will have to be applied to all active branches.


> Use black as a formatter for the whole PySpark codebase.
> 
>
> Key: SPARK-37022
> URL: https://issues.apache.org/jira/browse/SPARK-37022
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
> Attachments: black-diff-stats.txt, pyproject.toml
>
>
> [{{black}}|https://github.com/psf/black] is a popular Python code formatter. 
> It is used by a number of projects, both small and large, including prominent 
> ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used 
> to format a {{pyspark.pandas}} and (though not enforced) stubs files.
> We should consider using black to enforce formatting of all PySpark files. 
> There are multiple reasons to do that:
>  - Consistency: black is already used across existing codebase and black 
> formatted chunks of code are already added to modules other than 
> pyspark.pandas as a result of type hints inlining (SPARK-36845).
>  - Lower cost of contributing and reviewing: Formatting can be automatically 
> enforced and applied.
>  - Simplify reviews: In general, black formatted code, produces small and 
> highly readable diffs.
>  - Reduce effort required to maintain patched forks: smaller diffs + 
> predictable formatting.
> Risks:
>  - Initial reformatting requires quite significant changes.
>  - Applying black will break blame in GitHub UI (for git in general see 
> [Avoiding ruining git 
> blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).
> Additional steps:
>  - To simplify backporting, black will have to be applied to all active 
> branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.

2021-10-16 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429532#comment-17429532
 ] 

Maciej Szymkiewicz commented on SPARK-37022:


Looking at the diffs, the majority of changes will have minimal impact (magic 
trailing commas, switching to double quotes).
h3.  
h3.  

> Use black as a formatter for the whole PySpark codebase.
> 
>
> Key: SPARK-37022
> URL: https://issues.apache.org/jira/browse/SPARK-37022
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
> Attachments: black-diff-stats.txt, pyproject.toml
>
>
> [{{black}}|https://github.com/psf/black] is a popular Python code formatter. 
> It is used by a number of projects, both small and large, including prominent 
> ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used 
> to format a {{pyspark.pandas}} and (though not enforced) stubs files.
> We should consider using black to enforce formatting of all PySpark files. 
> There are multiple reasons to do that:
> - Consistency: black is already used across existing codebase and black 
> formatted chunks of code are already added to modules other than 
> pyspark.pandas as a result of type hints inlining (SPARK-36845).
> - Lower cost of contributing and reviewing: Formatting can be automatically 
> enforced and applied.
> - Simplify reviews: In general, black formatted code, produces small and 
> highly readable diffs.
> Risks:
> - Initial reformatting requires quite significant changes.
> - Applying black will break blame in GitHub UI (for git in general see 
> [Avoiding ruining git 
> blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).
> Additional steps:
> - To simplify backporting, black will have to be applied to all active 
> branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36232) Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled

2021-10-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36232.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34299
[https://github.com/apache/spark/pull/34299]

> Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled
> 
>
> Key: SPARK-36232
> URL: https://issues.apache.org/jira/browse/SPARK-36232
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
>  
> {code:java}
> >>> import decimal as d
> >>> import pyspark.pandas as ps
> >>> import numpy as np
> >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
> >>>  True)
> >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
> 0   1
> 1   2
> 2None
> dtype: object
> >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
> >>>  False)
> >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
> 21/07/02 15:01:07 ERROR Executor: Exception in task 6.0 in stage 13.0 (TID 51)
> net.razorvine.pickle.PickleException: problem construction object: 
> java.lang.reflect.InvocationTargetException
> ...
> {code}
> As the code is shown above, we cannot create a Series with `Decimal('NaN')` 
> when Arrow disabled. We ought to fix that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36232) Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled

2021-10-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36232:


Assignee: Yikun Jiang

> Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled
> 
>
> Key: SPARK-36232
> URL: https://issues.apache.org/jira/browse/SPARK-36232
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>
>  
> {code:java}
> >>> import decimal as d
> >>> import pyspark.pandas as ps
> >>> import numpy as np
> >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
> >>>  True)
> >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
> 0   1
> 1   2
> 2None
> dtype: object
> >>> ps.utils.default_session().conf.set('spark.sql.execution.arrow.pyspark.enabled',
> >>>  False)
> >>> ps.Series([d.Decimal(1.0), d.Decimal(2.0), d.Decimal(np.nan)])
> 21/07/02 15:01:07 ERROR Executor: Exception in task 6.0 in stage 13.0 (TID 51)
> net.razorvine.pickle.PickleException: problem construction object: 
> java.lang.reflect.InvocationTargetException
> ...
> {code}
> As the code is shown above, we cannot create a Series with `Decimal('NaN')` 
> when Arrow disabled. We ought to fix that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36230) hasnans for Series of Decimal(`NaN`)

2021-10-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36230.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34299
[https://github.com/apache/spark/pull/34299]

> hasnans for Series of Decimal(`NaN`)
> 
>
> Key: SPARK-36230
> URL: https://issues.apache.org/jira/browse/SPARK-36230
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>
> {code:java}
> >>> import pandas as pd
> >>> pser = pd.Series([Decimal('0.1'), Decimal('NaN')])
> >>> pser
> 00.1
> 1NaN
> dtype: object
> >>> psser = ps.from_pandas(pser)
> >>> psser
> 0 0.1
> 1None
> dtype: object
> >>> psser.hasnans
> False
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36230) hasnans for Series of Decimal(`NaN`)

2021-10-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36230:


Assignee: Yikun Jiang

> hasnans for Series of Decimal(`NaN`)
> 
>
> Key: SPARK-36230
> URL: https://issues.apache.org/jira/browse/SPARK-36230
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Yikun Jiang
>Priority: Major
>
> {code:java}
> >>> import pandas as pd
> >>> pser = pd.Series([Decimal('0.1'), Decimal('NaN')])
> >>> pser
> 00.1
> 1NaN
> dtype: object
> >>> psser = ps.from_pandas(pser)
> >>> psser
> 0 0.1
> 1None
> dtype: object
> >>> psser.hasnans
> False
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org