date:20220302

[jira] [Created] (SPARK-38404) Spark does not find CTE inside nested CTE

2022-03-02 Thread Joan Heredia Rius (Jira)

Joan Heredia Rius created SPARK-38404:
-

 Summary: Spark does not find CTE inside nested CTE
 Key: SPARK-38404
 URL: https://issues.apache.org/jira/browse/SPARK-38404
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1, 3.2.0
 Environment: Tested on:
 * MacOS Monterrey 12.2.1 (21D62)
 * python 3.9.10
 * pip 22.0.3
 * pyspark 3.2.0 & 3.2.1 (SQL query does not work) and pyspark 3.0.1 and 3.1.3 
(SQL query works)
Reporter: Joan Heredia Rius


Hello! 

Seems that when defining CTEs and using them inside another CTE in Spark SQL, 
Spark thinks the inner call for the CTE is a table or view, which is not found 
and then it errors with `Table or view not found: `
h3. Steps to reproduce
 # `pip install pyspark==3.2.0` (also happens with 3.2.1)
 # start pyspark console by typing `pyspark` in the terminal
 # Try to run the following SQL with `spark.sql(sql)`

 
{code:java}
  WITH mock_cte__usersAS (
   SELECT 1 AS id
   ),
   model_under_test  AS (
 WITH usersAS (
  SELECT *
FROM mock_cte__users
  )
   SELECT *
 FROM users
   )
SELECT *
  FROM model_under_test;{code}
Spark will fail with 

 
{code:java}
pyspark.sql.utils.AnalysisException: Table or view not found: mock_cte__users; 
line 8 pos 29; {code}
I don't know if this is a regression or an expected behavior of the new 3.2.* 
versions. This fix introduced in 3.2.0 might be related: 
https://issues.apache.org/jira/browse/SPARK-36447

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38104) Use error classes in the parsing errors of windows

2022-03-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500556#comment-17500556
 ] 

Apache Spark commented on SPARK-38104:
--

User 'yutoacts' has created a pull request for this issue:
https://github.com/apache/spark/pull/35718

> Use error classes in the parsing errors of windows
> --
>
> Key: SPARK-38104
> URL: https://issues.apache.org/jira/browse/SPARK-38104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryParsingErrors:
> * repetitiveWindowDefinitionError
> * invalidWindowReferenceError
> * cannotResolveWindowReferenceError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38104) Use error classes in the parsing errors of windows

2022-03-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38104:


Assignee: (was: Apache Spark)

> Use error classes in the parsing errors of windows
> --
>
> Key: SPARK-38104
> URL: https://issues.apache.org/jira/browse/SPARK-38104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryParsingErrors:
> * repetitiveWindowDefinitionError
> * invalidWindowReferenceError
> * cannotResolveWindowReferenceError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38104) Use error classes in the parsing errors of windows

2022-03-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500555#comment-17500555
 ] 

Apache Spark commented on SPARK-38104:
--

User 'yutoacts' has created a pull request for this issue:
https://github.com/apache/spark/pull/35718

> Use error classes in the parsing errors of windows
> --
>
> Key: SPARK-38104
> URL: https://issues.apache.org/jira/browse/SPARK-38104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryParsingErrors:
> * repetitiveWindowDefinitionError
> * invalidWindowReferenceError
> * cannotResolveWindowReferenceError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38104) Use error classes in the parsing errors of windows

2022-03-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38104:


Assignee: Apache Spark

> Use error classes in the parsing errors of windows
> --
>
> Key: SPARK-38104
> URL: https://issues.apache.org/jira/browse/SPARK-38104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Migrate the following errors in QueryParsingErrors:
> * repetitiveWindowDefinitionError
> * invalidWindowReferenceError
> * cannotResolveWindowReferenceError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38403) `running-on-kubernetes` page render bad in v3.2.1(latest) website

2022-03-02 Thread Kent Yao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500553#comment-17500553
 ] 

Kent Yao commented on SPARK-38403:
--

I guess it has been fixed via 
https://github.com/apache/spark/commit/3a179d762602243497c528bea3b3370e7548fa2d

> `running-on-kubernetes` page render bad in v3.2.1(latest) website
> -
>
> Key: SPARK-38403
> URL: https://issues.apache.org/jira/browse/SPARK-38403
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Yikun Jiang
>Priority: Major
>
> Looks like the `running-on-kubernetes` page encounterd some problems when 
> published.
> [1] 
> https://spark.apache.org/docs/latest/running-on-kubernetes.html#spark-properties
> (You can see bad format after #spark-properties)
> - I also check the master branch (setup local env) and also v3.2.0 
> (https://spark.apache.org/docs/3.2.0/running-on-kubernetes.html) it works 
> well, .- But for v3.2.1 tag, I couldn't install doc deps due to deps conflict.
> I'm not very familiar with doc infra tool. Can anyone help to take a look?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38403) `running-on-kubernetes` page render bad in v3.2.1(latest) website

2022-03-02 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-38403:
---

 Summary: `running-on-kubernetes` page render bad in v3.2.1(latest) 
website
 Key: SPARK-38403
 URL: https://issues.apache.org/jira/browse/SPARK-38403
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 3.2.1
Reporter: Yikun Jiang


Looks like the `running-on-kubernetes` page encounterd some problems when 
published.

[1] 
https://spark.apache.org/docs/latest/running-on-kubernetes.html#spark-properties

(You can see bad format after #spark-properties)
- I also check the master branch (setup local env) and also v3.2.0 
(https://spark.apache.org/docs/3.2.0/running-on-kubernetes.html) it works well, 
.- But for v3.2.1 tag, I couldn't install doc deps due to deps conflict.
I'm not very familiar with doc infra tool. Can anyone help to take a look?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38402) Improve user experience when working on data frames created from CSV and JSON in PERMISSIVE mode.

2022-03-02 Thread Dilip Biswal (Jira)

Dilip Biswal created SPARK-38402:


 Summary: Improve user experience when working on data frames 
created from CSV and JSON in PERMISSIVE mode.
 Key: SPARK-38402
 URL: https://issues.apache.org/jira/browse/SPARK-38402
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.1
Reporter: Dilip Biswal


In our data processing pipeline, we first process the user supplied data and 
eliminate invalid/corrupt records. So we parse JSON and CSV files in PERMISSIVE 
mode where all the invalid records are captured in "_corrupt_record". We then 
apply predicates on "_corrupt_record" to eliminate the bad records before 
subjecting the good records further in the processing pipeline.

We encountered two issues.
1. The introduction of "predicate pushdown" for CSV, does not take into account 
this system generated "_corrupt_column" and tries to push this down to scan 
resulting in an exception as the column is not part of base schema. 
2. Applying predicates on "_corrupt_column" results in a AnalysisException like 
following.
{code:java}
val schema = new StructType()
  .add("id",IntegerType,true)
  .add("weight",IntegerType,true) // The weight field is defined wrongly. The 
actual data contains floating point numbers, while the schema specifies an 
integer.
  .add("price",IntegerType,true)
  .add("_corrupt_record", StringType, true) // The schema contains a special 
column _corrupt_record, which does not exist in the data. This column captures 
rows that did not parse correctly.

val csv_with_wrong_schema = spark.read.format("csv")
  .option("header", "true")
  .schema(schema)
  .load("/FileStore/tables/csv_corrupt_record.csv")

val badRows = csv_with_wrong_schema.filter($"_corrupt_record".isNotNull)
7
val numBadRows = badRows.count()
 Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).csv(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().

{code:java}

For (1), we have disabled predicate pushdown.
For (2), we currently cache the data frame before using it , however, its not 
convenient and we would like to see a better user experience.  
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38401) Unify get preferred locations for shuffle in AQE

2022-03-02 Thread XiDuo You (Jira)

XiDuo You created SPARK-38401:
-

 Summary: Unify get preferred locations for shuffle in AQE
 Key: SPARK-38401
 URL: https://issues.apache.org/jira/browse/SPARK-38401
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: XiDuo You


It has several issues in the method `getPreferredLocations` of `ShuffledRowRDD`.
 * it does not respect the config `spark.shuffle.reduceLocality.enabled`, so we 
can not disable it.
 * it does not respect `REDUCER_PREF_LOCS_FRACTION`, so it has no effect if DAG 
schedule task to an executor who has less data. In worse, driver will take more 
memory to store the useless locations.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38104) Use error classes in the parsing errors of windows

2022-03-02 Thread Yuto Akutsu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500535#comment-17500535
 ] 

Yuto Akutsu commented on SPARK-38104:
-

I'm working on this.

> Use error classes in the parsing errors of windows
> --
>
> Key: SPARK-38104
> URL: https://issues.apache.org/jira/browse/SPARK-38104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryParsingErrors:
> * repetitiveWindowDefinitionError
> * invalidWindowReferenceError
> * cannotResolveWindowReferenceError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38400) Enable Series.rename to change index labels

2022-03-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500530#comment-17500530
 ] 

Apache Spark commented on SPARK-38400:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/35717

> Enable Series.rename to change index labels
> ---
>
> Key: SPARK-38400
> URL: https://issues.apache.org/jira/browse/SPARK-38400
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Enable Series.rename to change index labels, with function `index` input.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38400) Enable Series.rename to change index labels

2022-03-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38400:


Assignee: (was: Apache Spark)

> Enable Series.rename to change index labels
> ---
>
> Key: SPARK-38400
> URL: https://issues.apache.org/jira/browse/SPARK-38400
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Enable Series.rename to change index labels, with function `index` input.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38400) Enable Series.rename to change index labels

2022-03-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38400:


Assignee: Apache Spark

> Enable Series.rename to change index labels
> ---
>
> Key: SPARK-38400
> URL: https://issues.apache.org/jira/browse/SPARK-38400
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Enable Series.rename to change index labels, with function `index` input.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38398) Add `priorityClassName` integration test case

2022-03-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500531#comment-17500531
 ] 

Apache Spark commented on SPARK-38398:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35716

> Add `priorityClassName` integration test case
> -
>
> Key: SPARK-38398
> URL: https://issues.apache.org/jira/browse/SPARK-38398
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38383) Support APP_ID and EXECUTOR_ID placeholder in annotations

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38383:
--
Parent: SPARK-36057
Issue Type: Sub-task  (was: Improvement)

> Support APP_ID and EXECUTOR_ID placeholder in annotations
> -
>
> Key: SPARK-38383
> URL: https://issues.apache.org/jira/browse/SPARK-38383
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38400) Enable Series.rename to change index labels

2022-03-02 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-38400:


 Summary: Enable Series.rename to change index labels
 Key: SPARK-38400
 URL: https://issues.apache.org/jira/browse/SPARK-38400
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng


Enable Series.rename to change index labels, with function `index` input.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38398) Add `priorityClassName` integration test case

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38398:
-

Assignee: Dongjoon Hyun

> Add `priorityClassName` integration test case
> -
>
> Key: SPARK-38398
> URL: https://issues.apache.org/jira/browse/SPARK-38398
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38399) Why doesn't shuffle hash join support build left table for left-outer-join but full-outer-join?

2022-03-02 Thread mengdou (Jira)

mengdou created SPARK-38399:
---

 Summary: Why doesn't shuffle hash join support build left table 
for left-outer-join but full-outer-join?
 Key: SPARK-38399
 URL: https://issues.apache.org/jira/browse/SPARK-38399
 Project: Spark
  Issue Type: Question
  Components: Spark Core, SQL
Affects Versions: 3.2.0
Reporter: mengdou


Why doesn't shuffle hash join support building left table for left-outer-join, 
but it supports building right table for full-outer-join?

!image-2022-03-03-13-53-50-520.png!

 

IMO, if left table is the build table, similar to full-outer-table,  we can 
first create a BitSet to record any mismatch of next joins, and iterate all 
rows from stream table iterator and look up the hash key in the hash relation.  

If no one from stream table can join with the built hash relation, then we 
iterate the hash relation and get relative value from the BitSet, so we can get 
the left-outer rows.

 

Does anyone helps? Thx~

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38398) Add `priorityClassName` integration test case

2022-03-02 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-38398:
-

 Summary: Add `priorityClassName` integration test case
 Key: SPARK-38398
 URL: https://issues.apache.org/jira/browse/SPARK-38398
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes, Tests
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38334) Implement support for DEFAULT values for columns in tables

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38334:
--
Affects Version/s: 3.3.0
   (was: 3.2.1)

> Implement support for DEFAULT values for columns in tables 
> ---
>
> Key: SPARK-38334
> URL: https://issues.apache.org/jira/browse/SPARK-38334
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Daniel
>Priority: Major
>
> This story tracks the implementation of DEFAULT values for columns in tables.
> CREATE TABLE and ALTER TABLE invocations will support setting column default 
> values for future operations. Following INSERT, UPDATE, MERGE statements may 
> then reference the value using the DEFAULT keyword as needed.
> Examples:
> {code:sql}
> CREATE TABLE T(a INT, b INT NOT NULL);
> -- The default default is NULL
> INSERT INTO T VALUES (DEFAULT, 0);
> INSERT INTO T(b)  VALUES (1);
> SELECT * FROM T;
> (NULL, 0)
> (NULL, 1)
> -- Adding a default to a table with rows, sets the values for the
> -- existing rows (exist default) and new rows (current default).
> ALTER TABLE T ADD COLUMN c INT DEFAULT 5;
> INSERT INTO T VALUES (1, 2, DEFAULT);
> SELECT * FROM T;
> (NULL, 0, 5)
> (NULL, 1, 5)
> (1, 2, 5) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38334) Implement support for DEFAULT values for columns in tables

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38334:
--
Issue Type: Improvement  (was: Story)

> Implement support for DEFAULT values for columns in tables 
> ---
>
> Key: SPARK-38334
> URL: https://issues.apache.org/jira/browse/SPARK-38334
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Daniel
>Priority: Major
>
> This story tracks the implementation of DEFAULT values for columns in tables.
> CREATE TABLE and ALTER TABLE invocations will support setting column default 
> values for future operations. Following INSERT, UPDATE, MERGE statements may 
> then reference the value using the DEFAULT keyword as needed.
> Examples:
> {code:sql}
> CREATE TABLE T(a INT, b INT NOT NULL);
> -- The default default is NULL
> INSERT INTO T VALUES (DEFAULT, 0);
> INSERT INTO T(b)  VALUES (1);
> SELECT * FROM T;
> (NULL, 0)
> (NULL, 1)
> -- Adding a default to a table with rows, sets the values for the
> -- existing rows (exist default) and new rows (current default).
> ALTER TABLE T ADD COLUMN c INT DEFAULT 5;
> INSERT INTO T VALUES (1, 2, DEFAULT);
> SELECT * FROM T;
> (NULL, 0, 5)
> (NULL, 1, 5)
> (1, 2, 5) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38395) Pyspark issue in resolving column when there is dot (.)

2022-03-02 Thread Krishna Sangeeth KS (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krishna Sangeeth KS updated SPARK-38395:

Description: 
Pyspark apply in pandas have some difficulty in resolving columns when there is 
dot in the column name. 

Here is an example that I have which reproduces the issue.  Example taken by 
modifying doctest example 
[here|https://github.com/apache/spark/blob/branch-3.0/python/pyspark/sql/pandas/group_ops.py#L237-L248]
{code:python}
df1 = spark.createDataFrame(
[(2101, 1, 1.0), (2101, 2, 2.0), (2102, 1, 3.0), (2102, 2, 
4.0)],
("abc|database|10.159.154|xef", "id", "v1"))
df2 = spark.createDataFrame(
[(2101, 1, "x"), (2101, 2, "y")],
("abc|database|10.159.154|xef", "id", "v2"))
def asof_join(l, r):
return pd.merge_asof(l, r, on="abc|database|10.159.154|xef", by="id")
df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
asof_join, schema="`abc|database|10.159.154|xef` int, id int, v1 double, v2 
string").show(){code}
This gives the below error
{code:python}
AnalysisException Traceback (most recent call last)
 in 
  8 return pd.merge_asof(l, r, on="abc|database|10.159.154|xef", 
by="id")
  9 df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
---> 10 asof_join, schema="`abc|database|10.159.154|xef` int, id int, v1 
double, v2 string").show()

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/pandas/group_ops.py
 in applyInPandas(self, func, schema)
295 udf = pandas_udf(
296 func, returnType=schema, 
functionType=PythonEvalType.SQL_COGROUPED_MAP_PANDAS_UDF)
--> 297 all_cols = self._extract_cols(self._gd1) + 
self._extract_cols(self._gd2)
298 udf_column = udf(*all_cols)
299 jdf = self._gd1._jgd.flatMapCoGroupsInPandas(self._gd2._jgd, 
udf_column._jc.expr())

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/pandas/group_ops.py
 in _extract_cols(gd)
303 def _extract_cols(gd):
304 df = gd._df
--> 305 return [df[col] for col in df.columns]
306 
307 

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/pandas/group_ops.py
 in (.0)
303 def _extract_cols(gd):
304 df = gd._df
--> 305 return [df[col] for col in df.columns]
306 
307 

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/dataframe.py in 
__getitem__(self, item)
   1378 """
   1379 if isinstance(item, basestring):
-> 1380 jc = self._jdf.apply(item)
   1381 return Column(jc)
   1382 elif isinstance(item, Column):

~/anaconda3/envs/py37/lib/python3.7/site-packages/py4j/java_gateway.py in 
__call__(self, *args)
   1303 answer = self.gateway_client.send_command(command)
   1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307 for temp_arg in temp_args:

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/utils.py in 
deco(*a, **kw)
135 # Hide where the exception came from that shows a 
non-Pythonic
136 # JVM exception message.
--> 137 raise_from(converted)
138 else:
139 raise

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/utils.py in 
raise_from(e)

AnalysisException: Cannot resolve column name "abc|database|10.159.154|xef" 
among (abc|database|10.159.154|xef, id, v1); did you mean to quote the 
`abc|database|10.159.154|xef` column?;

{code}
As we can see the column is present there in the `among` list.

When i replace `.` (dot) with `_` (underscore) the code actually works.
{code:python}
df1 = spark.createDataFrame(
[(2101, 1, 1.0), (2101, 2, 2.0), (2102, 1, 3.0), (2102, 2, 
4.0)],
("abc|database|10_159_154|xef", "id", "v1"))
df2 = spark.createDataFrame(
[(2101, 1, "x"), (2101, 2, "y")],
("abc|database|10_159_154|xef", "id", "v2"))
def asof_join(l, r):
return pd.merge_asof(l, r, on="abc|database|10_159_154|xef", by="id")
df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
asof_join, schema="`abc|database|10_159_154|xef` int, id int, v1 double, v2 
string").show()
{code}
{code:java}
+---+---+---+---+
|abc|database|10_159_154|xef| id| v1| v2|
+---+---+---+---+
|   2101|  1|1.0|  x|
|   2102|  1|3.0|  x|
|   2101|  2|2.0|  y|
|   2102|  2|4.0|  y|
+---+---+---+---+
{code}

  was:
Pyspark apply in pandas have some difficult in resolving columns when there is 
dot in the column name. 

Here is an example that I have which reproduces the issue.  Example taken by 
modifying doctest example 
[here|https://github.com/apache/spark/blob/branch-3.0/pytho

[jira] [Commented] (SPARK-38340) Upgrade protobuf-java from 2.5.0 to 3.16.1

2022-03-02 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500487#comment-17500487
 ] 

Yang Jie commented on SPARK-38340:
--

cc [~hyukjin.kwon] , do you know how GA tests code that doesn't merged with 
hadoop-2 profile?

 

> Upgrade protobuf-java from 2.5.0 to 3.16.1
> --
>
> Key: SPARK-38340
> URL: https://issues.apache.org/jira/browse/SPARK-38340
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
>  
> [CVE-2021-22569|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-22569]
> To do this upgrade I have done
> external/kinesis-asl-assembly/pom.xml change line 65 to 
> 3.16.1 
> pom.xml change line 124 to 3.16.1
> run 
> ./dev/test-dependencies.sh --replace-manifest 
> witch change 
> dev/deps/spark-deps-hadoop-2-hive-2.3 line 235 to 
> protobuf-java/3.16.1//protobuf-java-3.16.1.jar
> and 
> dev/deps/spark-deps-hadoop-3-hive-2.3 to 
> protobuf-java/3.16.1//protobuf-java-3.16.1.jar 
> My branch 
> [protobuf-java-from-2.5.0-to-3.16.1|https://github.com/bjornjorgensen/spark/tree/protobuf-java-from-2.5.0-to-3.16.1]
>  is OK with testes, but  when I run 
> ./build/mvn -DskipTests clean package && ./build/mvn -e package
>  
> I get this error:
> 01:01:41.381 WARN 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReaderSuite: 
> = POSSIBLE THREAD LEAK IN SUITE 
> o.a.s.sql.execution.datasources.orc.OrcColumnarBatchReaderSuite, threads: 
> rpc-boss-3348-1 (daemon=true), shuffle-boss-3351-1 (daemon=true) =
> Run completed in 1 hour, 7 minutes, 35 seconds.
> Total number of tests run: 11260
> Suites: completed 505, aborted 0
> Tests: succeeded 11259, failed 1, canceled 5, ignored 57, pending 0
> *** 1 TEST FAILED ***
> [INFO] 
> 
> [INFO] Reactor Summary for Spark Project Parent POM 3.3.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  3.396 
> s]
> [INFO] Spark Project Tags . SUCCESS [  7.374 
> s]
> [INFO] Spark Project Sketch ... SUCCESS [  9.324 
> s]
> [INFO] Spark Project Local DB . SUCCESS [  4.097 
> s]
> [INFO] Spark Project Networking ... SUCCESS [ 47.468 
> s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [ 10.478 
> s]
> [INFO] Spark Project Unsafe ... SUCCESS [  2.425 
> s]
> [INFO] Spark Project Launcher . SUCCESS [  2.767 
> s]
> [INFO] Spark Project Core . SUCCESS [30:56 
> min]
> [INFO] Spark Project ML Local Library . SUCCESS [ 29.105 
> s]
> [INFO] Spark Project GraphX ... SUCCESS [02:09 
> min]
> [INFO] Spark Project Streaming  SUCCESS [05:21 
> min]
> [INFO] Spark Project Catalyst . SUCCESS [08:15 
> min]
> [INFO] Spark Project SQL .. FAILURE [  01:11 
> h]
> [INFO] Spark Project ML Library ... SKIPPED
> [INFO] Spark Project Tools  SKIPPED
> [INFO] Spark Project Hive . SKIPPED
> [INFO] Spark Project REPL . SKIPPED
> [INFO] Spark Project Assembly . SKIPPED
> [INFO] Kafka 0.10+ Token Provider for Streaming ... SKIPPED
> [INFO] Spark Integration for Kafka 0.10 ... SKIPPED
> [INFO] Kafka 0.10+ Source for Structured Streaming  SKIPPED
> [INFO] Spark Project Examples . SKIPPED
> [INFO] Spark Integration for Kafka 0.10 Assembly .. SKIPPED
> [INFO] Spark Avro . SKIPPED
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time:  02:00 h
> [INFO] Finished at: 2022-02-27T01:01:44+01:00
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> org.scalatest:scalatest-maven-plugin:2.0.2:test (test) on project 
> spark-sql_2.12: There are test failures -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute 
> goal org.scalatest:scalatest-maven-plugin:2.0.2:test (test) on project 
> spark-sql_2.12: There are test failures
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:215)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor

[jira] [Updated] (SPARK-38397) Support Kueue: K8s-native Job Queueing

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38397:
--
Description: 
There are several ways to run Spark on K8s including vanilla `spark-submit` 
with built-in  `KubernetesClusterManager`, `spark-submit` with custom 
`ExternalClusterManager`, CRD-based operators (like spark-on-k8s-operator), 
custom K8s `schedulers`, custom `standalone pod definitions`, and so on.

This issue is tracking K8s-native Job Queueing related work.
 * [https://github.com/kubernetes-sigs/kueue]
{code}
metadata:
  generateName: sample-job-
  annotations:
kueue.k8s.io/queue-name: main
{code}

The best case is Apache Spark users use it in the future via pod templates or 
existing configuration. In other words, we don't need to do anything and close 
this JIRA without any patches.

  was:
This issue is tracking K8s-native Job Queueing related work.
 * [https://github.com/kubernetes-sigs/kueue]
{code}
metadata:
  generateName: sample-job-
  annotations:
kueue.k8s.io/queue-name: main
{code}

The best case is Apache Spark users use it in the future via pod templates or 
existing configuration. In other words, we don't need to do anything and close 
this JIRA without any patches.


> Support Kueue: K8s-native Job Queueing
> --
>
> Key: SPARK-38397
> URL: https://issues.apache.org/jira/browse/SPARK-38397
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> There are several ways to run Spark on K8s including vanilla `spark-submit` 
> with built-in  `KubernetesClusterManager`, `spark-submit` with custom 
> `ExternalClusterManager`, CRD-based operators (like spark-on-k8s-operator), 
> custom K8s `schedulers`, custom `standalone pod definitions`, and so on.
> This issue is tracking K8s-native Job Queueing related work.
>  * [https://github.com/kubernetes-sigs/kueue]
> {code}
> metadata:
>   generateName: sample-job-
>   annotations:
> kueue.k8s.io/queue-name: main
> {code}
> The best case is Apache Spark users use it in the future via pod templates or 
> existing configuration. In other words, we don't need to do anything and 
> close this JIRA without any patches.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38397) Support Kueue: K8s-native Job Queueing

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38397:
--
Description: 
This issue is tracking K8s-native Job Queueing related work.
 * [https://github.com/kubernetes-sigs/kueue]
{code}
metadata:
  generateName: sample-job-
  annotations:
kueue.k8s.io/queue-name: main
{code}

The best case is Apache Spark users use it in the future via pod templates or 
existing configuration. In other words, we don't need to do anything and close 
this JIRA without any patches.

  was:
This issue is tracking K8s-native Job Queueing related work.
 * [https://github.com/kubernetes-sigs/kueue]

The best case is Apache Spark users use it in the future via pod templates. In 
other words, we can close this JIRA as `Done` without any patches.


> Support Kueue: K8s-native Job Queueing
> --
>
> Key: SPARK-38397
> URL: https://issues.apache.org/jira/browse/SPARK-38397
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue is tracking K8s-native Job Queueing related work.
>  * [https://github.com/kubernetes-sigs/kueue]
> {code}
> metadata:
>   generateName: sample-job-
>   annotations:
> kueue.k8s.io/queue-name: main
> {code}
> The best case is Apache Spark users use it in the future via pod templates or 
> existing configuration. In other words, we don't need to do anything and 
> close this JIRA without any patches.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38397) Support Kueue: K8s-native Job Queueing

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38397:
--
Description: 
This issue is tracking K8s-native Job Queueing related work.
 * [https://github.com/kubernetes-sigs/kueue]

The best case is Apache Spark users use it in the future via pod templates. In 
other words, we can close this JIRA as `Done` without any patches.

  was:
This issue is tracking K8s-native Job Queueing related work.
 * [https://github.com/kubernetes-sigs/kueue]

The best case is Apache Spark users use it in the future via pod templates.


> Support Kueue: K8s-native Job Queueing
> --
>
> Key: SPARK-38397
> URL: https://issues.apache.org/jira/browse/SPARK-38397
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue is tracking K8s-native Job Queueing related work.
>  * [https://github.com/kubernetes-sigs/kueue]
> The best case is Apache Spark users use it in the future via pod templates. 
> In other words, we can close this JIRA as `Done` without any patches.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38397) Support Kueue: K8s-native Job Queueing

2022-03-02 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-38397:
-

 Summary: Support Kueue: K8s-native Job Queueing
 Key: SPARK-38397
 URL: https://issues.apache.org/jira/browse/SPARK-38397
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun


This issue is tracking K8s-native Job Queueing related work.
 * [https://github.com/kubernetes-sigs/kueue]

The best case is Apache Spark users use it in the future via pod templates.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38310) Support job queue in YuniKorn feature step

2022-03-02 Thread Wilfred Spiegelenburg (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500448#comment-17500448
 ] 

Wilfred Spiegelenburg commented on SPARK-38310:
---

Spark on yarn uses the --queue option as part of the submit. Should the same 
option work for Spark on K8s when used in combination with YuniKorn or Volcano 
as a scheduler? Driver pod annotations and labels support the queue 
specification on the YuniKorn side. Volcano has a similar approach through the 
pod group.

Would be nice to be able to leverage the existing command line option to submit 
to a specific queue

> Support job queue in YuniKorn feature step
> --
>
> Key: SPARK-38310
> URL: https://issues.apache.org/jira/browse/SPARK-38310
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.2
> Environment: Like SPARK-38188, yunikorn needs to support the queue 
> property and use that in the feature step to properly configure 
> driver/executor pods.
>Reporter: Weiwei Yang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37753) DynamicJoinSelection.shouldDemoteBroadcastHashJoin should apply to left outer with join many empty partitions on the left

2022-03-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500446#comment-17500446
 ] 

Apache Spark commented on SPARK-37753:
--

User 'ekoifman' has created a pull request for this issue:
https://github.com/apache/spark/pull/35715

> DynamicJoinSelection.shouldDemoteBroadcastHashJoin should apply to left outer 
> with join many empty partitions on the left
> -
>
> Key: SPARK-37753
> URL: https://issues.apache.org/jira/browse/SPARK-37753
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8
>Reporter: Eugene Koifman
>Priority: Major
>
> if {{DynamicJoinSelection}} sees a LOJ where there are many empty partitions 
> on the left, we want to run it as a shuffle join so that tasks with empty LHS 
> short circuit quickly.
> So we should inhibit Broadcast join.
> This is a follow up to SPARK-37193



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37753) DynamicJoinSelection.shouldDemoteBroadcastHashJoin should apply to left outer with join many empty partitions on the left

2022-03-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37753:


Assignee: Apache Spark

> DynamicJoinSelection.shouldDemoteBroadcastHashJoin should apply to left outer 
> with join many empty partitions on the left
> -
>
> Key: SPARK-37753
> URL: https://issues.apache.org/jira/browse/SPARK-37753
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8
>Reporter: Eugene Koifman
>Assignee: Apache Spark
>Priority: Major
>
> if {{DynamicJoinSelection}} sees a LOJ where there are many empty partitions 
> on the left, we want to run it as a shuffle join so that tasks with empty LHS 
> short circuit quickly.
> So we should inhibit Broadcast join.
> This is a follow up to SPARK-37193



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37753) DynamicJoinSelection.shouldDemoteBroadcastHashJoin should apply to left outer with join many empty partitions on the left

2022-03-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37753:


Assignee: (was: Apache Spark)

> DynamicJoinSelection.shouldDemoteBroadcastHashJoin should apply to left outer 
> with join many empty partitions on the left
> -
>
> Key: SPARK-37753
> URL: https://issues.apache.org/jira/browse/SPARK-37753
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8
>Reporter: Eugene Koifman
>Priority: Major
>
> if {{DynamicJoinSelection}} sees a LOJ where there are many empty partitions 
> on the left, we want to run it as a shuffle join so that tasks with empty LHS 
> short circuit quickly.
> So we should inhibit Broadcast join.
> This is a follow up to SPARK-37193



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38191) The staging directory of write job only needs to be initialized once in HadoopMapReduceCommitProtocol.

2022-03-02 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-38191.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35693
[https://github.com/apache/spark/pull/35693]

> The staging directory of write job only needs to be initialized once in 
> HadoopMapReduceCommitProtocol.
> --
>
> Key: SPARK-38191
> URL: https://issues.apache.org/jira/browse/SPARK-38191
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1
>Reporter: weixiuli
>Assignee: weixiuli
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37895) Error while joining two tables with non-english field names

2022-03-02 Thread Pablo Langa Blanco (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500419#comment-17500419
 ] 

Pablo Langa Blanco commented on SPARK-37895:


I'm working on a fix

> Error while joining two tables with non-english field names
> ---
>
> Key: SPARK-37895
> URL: https://issues.apache.org/jira/browse/SPARK-37895
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Marina Krasilnikova
>Priority: Minor
>
> While trying to join two tables with non-english field names in postgresql 
> with query like
> "select view1.`Имя1` , view1.`Имя2`, view2.`Имя3` from view1 left join  view2 
> on view1.`Имя2`=view2.`Имя4`"
> we get an error which says that there is no field "`Имя4`" (field name is 
> surrounded by backticks).
> It appears that to get the data from the second table it constructs query like
> SELECT "Имя3","Имя4" FROM "public"."tab2"  WHERE ("`Имя4`" IS NOT NULL) 
> and these backticks are redundant in WHERE clause.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38396) Improve K8s Integration Tests

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38396:
--
Description: 
This JIRA aims to improve K8s integration tests for the following recent 
activity.
 - Java 17
 - Cloud K8s testing (including Graviton instances and K8s 1.22 GA on EKS on 
March 15th)
 - New K8s model (K8s clients)
 - New custom schedulers and statefulset

  was:
This JIRA aims to improve K8s integration tests for the following recent 
activity.
- Java 17
- Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th)
- New K8s model (K8s clients)
- New custom schedulers and statefulset


> Improve K8s Integration Tests
> -
>
> Key: SPARK-38396
> URL: https://issues.apache.org/jira/browse/SPARK-38396
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: William Hyun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> This JIRA aims to improve K8s integration tests for the following recent 
> activity.
>  - Java 17
>  - Cloud K8s testing (including Graviton instances and K8s 1.22 GA on EKS on 
> March 15th)
>  - New K8s model (K8s clients)
>  - New custom schedulers and statefulset



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38081) Support cloud-backend in K8s IT with SBT

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38081:
--
Fix Version/s: 3.2.2

> Support cloud-backend in K8s IT with SBT
> 
>
> Key: SPARK-38081
> URL: https://issues.apache.org/jira/browse/SPARK-38081
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-38379) Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes

2022-03-02 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500416#comment-17500416
 ] 

Dongjoon Hyun edited comment on SPARK-38379 at 3/2/22, 10:54 PM:
-

Could you describe your procedure about this, [~tgraves]? I've been never 
running `spark-shell` on K8s cluster so far.
bq. and starting a spark-shell in client mode


was (Author: dongjoon):
Could you describe your procedure about this, [~tgraves]? I've been never 
running `spark-shell` on K8s cluster so far.
> and starting a spark-shell in client mode

> Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes 
> --
>
> Key: SPARK-38379
> URL: https://issues.apache.org/jira/browse/SPARK-38379
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: Thomas Graves
>Priority: Major
>
> I'm using Spark 3.2.1 on a kubernetes cluster and starting a spark-shell in 
> client mode.  I'm using persistent local volumes to mount nvme under /data in 
> the executors and on startup the driver always throws the warning below.
> using these options:
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=fast-disks
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=500Gi
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false
>  
>  
> {code:java}
> 22/03/01 20:21:22 WARN ExecutorPodsSnapshotsStoreImpl: Exception when 
> notifying snapshot subscriber.
> java.util.NoSuchElementException: spark.app.id
>         at org.apache.spark.SparkConf.$anonfun$get$1(SparkConf.scala:245)
>         at scala.Option.getOrElse(Option.scala:189)
>         at org.apache.spark.SparkConf.get(SparkConf.scala:245)
>         at org.apache.spark.SparkConf.getAppId(SparkConf.scala:450)
>         at 
> org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.$anonfun$constructVolumes$4(MountVolumesFeatureStep.scala:88)
>         at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>         at scala.collection.Iterator.foreach(Iterator.scala:943)
>         at scala.collection.Iterator.foreach$(Iterator.scala:943)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>         at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>         at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>         at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>         at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>         at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>         at 
> org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.constructVolumes(MountVolumesFeatureStep.scala:57)
>         at 
> org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.configurePod(MountVolumesFeatureStep.scala:34)
>         at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.$anonfun$buildFromFeatures$4(KubernetesExecutorBuilder.scala:64)
>         at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
>         at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
>         at scala.collection.immutable.List.foldLeft(List.scala:91)
>         at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.buildFromFeatures(KubernetesExecutorBuilder.scala:63)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:391)
>         at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339)
>         at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>         at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>         at

[jira] [Commented] (SPARK-38379) Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes

2022-03-02 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500416#comment-17500416
 ] 

Dongjoon Hyun commented on SPARK-38379:
---

Could you describe your procedure about this, [~tgraves]? I've been never 
running `spark-shell` on K8s cluster so far.
> and starting a spark-shell in client mode

> Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes 
> --
>
> Key: SPARK-38379
> URL: https://issues.apache.org/jira/browse/SPARK-38379
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: Thomas Graves
>Priority: Major
>
> I'm using Spark 3.2.1 on a kubernetes cluster and starting a spark-shell in 
> client mode.  I'm using persistent local volumes to mount nvme under /data in 
> the executors and on startup the driver always throws the warning below.
> using these options:
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=fast-disks
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=500Gi
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false
>  
>  
> {code:java}
> 22/03/01 20:21:22 WARN ExecutorPodsSnapshotsStoreImpl: Exception when 
> notifying snapshot subscriber.
> java.util.NoSuchElementException: spark.app.id
>         at org.apache.spark.SparkConf.$anonfun$get$1(SparkConf.scala:245)
>         at scala.Option.getOrElse(Option.scala:189)
>         at org.apache.spark.SparkConf.get(SparkConf.scala:245)
>         at org.apache.spark.SparkConf.getAppId(SparkConf.scala:450)
>         at 
> org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.$anonfun$constructVolumes$4(MountVolumesFeatureStep.scala:88)
>         at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>         at scala.collection.Iterator.foreach(Iterator.scala:943)
>         at scala.collection.Iterator.foreach$(Iterator.scala:943)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>         at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>         at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>         at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>         at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>         at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>         at 
> org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.constructVolumes(MountVolumesFeatureStep.scala:57)
>         at 
> org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.configurePod(MountVolumesFeatureStep.scala:34)
>         at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.$anonfun$buildFromFeatures$4(KubernetesExecutorBuilder.scala:64)
>         at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
>         at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
>         at scala.collection.immutable.List.foldLeft(List.scala:91)
>         at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.buildFromFeatures(KubernetesExecutorBuilder.scala:63)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:391)
>         at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339)
>         at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>         at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:339)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3(ExecutorPodsAllocator.scala:117)
>

[jira] [Updated] (SPARK-38353) Instrument enter and exit magic methods for pandas API on Spark

2022-03-02 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-38353:

Description: 
Create the ticket since instrumenting _{_}enter{_}_ and _{_}exit{_}_ magic 
methods for pandas API on Spark can help improve accuracy of the usage data. 
Besides, we are interested in extending the pandas-on-Spark usage logger to 
other PySpark modules in the future so it will help improve accuracy of usage 
data of other PySpark modules.

For example, for the following code:

 
{code:java}
pdf = pd.DataFrame(
[(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
self.assert_eq(
repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
False, True))
){code}
 

pandas-on-Spark usage logger records the internal call 
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
 since __enter__ and __exit__ methods of 
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
 are not instrumented.

  was:
Create the ticket since instrumenting __enter__ and __exit__ magic methods for 
pandas API on Spark can help improve accuracy of the usage data. Besides, we 
are interested in extending the pandas-on-Spark usage logger to other PySpark 
modules in the future so it will help improve accuracy of usage data of other 
PySpark modules.

For example, for the following code:

 
{code:java}
pdf = pd.DataFrame(
[(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
self.assert_eq(
repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
False, True))
){code}
 

pandas usage logger records the internal call 
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
 since _{_}enter{_}_ and _{_}exit{_}_ methods of 
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
 are not instrumented.


> Instrument __enter__ and __exit__ magic methods for pandas API on Spark
> ---
>
> Key: SPARK-38353
> URL: https://issues.apache.org/jira/browse/SPARK-38353
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Yihong He
>Priority: Minor
>
> Create the ticket since instrumenting _{_}enter{_}_ and _{_}exit{_}_ magic 
> methods for pandas API on Spark can help improve accuracy of the usage data. 
> Besides, we are interested in extending the pandas-on-Spark usage logger to 
> other PySpark modules in the future so it will help improve accuracy of usage 
> data of other PySpark modules.
> For example, for the following code:
>  
> {code:java}
> pdf = pd.DataFrame(
> [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
> )
> psdf = ps.from_pandas(pdf)
> with psdf.spark.cache() as cached_df:
> self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
> self.assert_eq(
> repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
> False, True))
> ){code}
>  
> pandas-on-Spark usage logger records the internal call 
> [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
>  since __enter__ and __exit__ methods of 
> [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
>  are not instrumented.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38353) Instrument enter and exit magic methods for pandas API on Spark

2022-03-02 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-38353:

Description: 
Create the ticket since instrumenting __enter__ and __exit__ magic methods for 
pandas API on Spark can help improve accuracy of the usage data. Besides, we 
are interested in extending the pandas-on-Spark usage logger to other PySpark 
modules in the future so it will help improve accuracy of usage data of other 
PySpark modules.

For example, for the following code:

 
{code:java}
pdf = pd.DataFrame(
[(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
self.assert_eq(
repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
False, True))
){code}
 

pandas usage logger records the internal call 
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
 since _{_}enter{_}_ and _{_}exit{_}_ methods of 
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
 are not instrumented.

  was:
Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and 
{_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve accuracy 
of the usage data. Besides, we are interested in extending the Pandas usage 
logger to other PySpark modules in the future so it will help improve accuracy 
of usage data of other PySpark modules.

For example, for the following code:

 
{code:java}
pdf = pd.DataFrame(
[(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
self.assert_eq(
repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
False, True))
){code}
 

pandas usage logger records the internal call 
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
 since __enter__ and __exit__ methods of 
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
 are not instrumented.


> Instrument __enter__ and __exit__ magic methods for pandas API on Spark
> ---
>
> Key: SPARK-38353
> URL: https://issues.apache.org/jira/browse/SPARK-38353
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Yihong He
>Priority: Minor
>
> Create the ticket since instrumenting __enter__ and __exit__ magic methods 
> for pandas API on Spark can help improve accuracy of the usage data. Besides, 
> we are interested in extending the pandas-on-Spark usage logger to other 
> PySpark modules in the future so it will help improve accuracy of usage data 
> of other PySpark modules.
> For example, for the following code:
>  
> {code:java}
> pdf = pd.DataFrame(
> [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
> )
> psdf = ps.from_pandas(pdf)
> with psdf.spark.cache() as cached_df:
> self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
> self.assert_eq(
> repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
> False, True))
> ){code}
>  
> pandas usage logger records the internal call 
> [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
>  since _{_}enter{_}_ and _{_}exit{_}_ methods of 
> [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
>  are not instrumented.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38353) Instrument enter and exit magic methods for pandas API on Spark

2022-03-02 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-38353:

Summary: Instrument __enter__ and __exit__ magic methods for pandas API on 
Spark  (was: Instrument __enter__ and __exit__ magic methods for Pandas module)

> Instrument __enter__ and __exit__ magic methods for pandas API on Spark
> ---
>
> Key: SPARK-38353
> URL: https://issues.apache.org/jira/browse/SPARK-38353
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Yihong He
>Priority: Minor
>
> Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and 
> {_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve 
> accuracy of the usage data. Besides, we are interested in extending the 
> Pandas usage logger to other PySpark modules in the future so it will help 
> improve accuracy of usage data of other PySpark modules.
> For example, for the following code:
>  
> {code:java}
> pdf = pd.DataFrame(
> [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
> )
> psdf = ps.from_pandas(pdf)
> with psdf.spark.cache() as cached_df:
> self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
> self.assert_eq(
> repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
> False, True))
> ){code}
>  
> pandas usage logger records the internal call 
> [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
>  since __enter__ and __exit__ methods of 
> [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
>  are not instrumented.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38396) Improve K8s Integration Tests

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38396:
--
Description: 
This JIRA aims to improve K8s integration tests for the following recent 
activity.
- Java 17
- Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th)
- New K8s model (K8s clients)
- New custom schedulers and statefulset

  was:
This JIRA aims to improve K8s integration tests for the following recent 
activity.
- Java 17
- Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th)
- New K8s model (K8s clients)
- New custom schedulers


> Improve K8s Integration Tests
> -
>
> Key: SPARK-38396
> URL: https://issues.apache.org/jira/browse/SPARK-38396
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: William Hyun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> This JIRA aims to improve K8s integration tests for the following recent 
> activity.
> - Java 17
> - Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th)
> - New K8s model (K8s clients)
> - New custom schedulers and statefulset



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33206) Spark Shuffle Index Cache calculates memory usage wrong

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33206:
--
Fix Version/s: 3.2.2

> Spark Shuffle Index Cache calculates memory usage wrong
> ---
>
> Key: SPARK-33206
> URL: https://issues.apache.org/jira/browse/SPARK-33206
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.0, 3.0.1
>Reporter: Lars Francke
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
> Attachments: image001(1).png
>
>
> SPARK-21501 changed the spark shuffle index service to be based on memory 
> instead of the number of files.
> Unfortunately, there's a problem with the calculation which is based on size 
> information provided by `ShuffleIndexInformation`.
> It is based purely on the file size of the cached file on disk.
> We're running in OOMs with very small index files (byte size ~16 bytes) but 
> the overhead of the ShuffleIndexInformation around this is much larger (e.g. 
> 184 bytes, see screenshot). We need to take this into account and should 
> probably add a fixed overhead of somewhere between 152 and 180 bytes 
> according to my tests. I'm not 100% sure what the correct number is and it'll 
> also depend on the architecture etc. so we can't be exact anyway.
> If we do that we can maybe get rid of the size field in 
> ShuffleIndexInformation to save a few more bytes per entry.
> In effect this means that for small files we use up about 70-100 times as 
> much memory as we intend to. Our NodeManagers OOM with 4GB and more of 
> indexShuffleCache.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38272) Use docker-desktop instead of docker-for-desktop for Docker K8S IT deployMode and context name

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38272:
--
Fix Version/s: 3.2.2

> Use docker-desktop instead of docker-for-desktop for Docker K8S IT deployMode 
> and context name 
> ---
>
> Key: SPARK-38272
> URL: https://issues.apache.org/jira/browse/SPARK-38272
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38396) Improve K8s Integration Tests

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38396.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

> Improve K8s Integration Tests
> -
>
> Key: SPARK-38396
> URL: https://issues.apache.org/jira/browse/SPARK-38396
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: William Hyun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.3.0
>
>
> This JIRA aims to improve K8s integration tests for the following recent 
> activity.
> - Java 17
> - Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th)
> - New K8s model (K8s clients)
> - New custom schedulers



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38396) Improve K8s Integration Tests

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38396:
-

Assignee: William Hyun

> Improve K8s Integration Tests
> -
>
> Key: SPARK-38396
> URL: https://issues.apache.org/jira/browse/SPARK-38396
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: William Hyun
>Priority: Major
>  Labels: releasenotes
>
> This JIRA aims to improve K8s integration tests for the following recent 
> activity.
> - Java 17
> - Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th)
> - New K8s model (K8s clients)
> - New custom schedulers



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37354) Make the Java version installed on the container image used by the K8s integration tests with SBT configurable

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37354:
--
Parent: SPARK-38396
Issue Type: Sub-task  (was: Bug)

> Make the Java version installed on the container image used by the K8s 
> integration tests with SBT configurable
> --
>
> Key: SPARK-37354
> URL: https://issues.apache.org/jira/browse/SPARK-37354
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> I noticed that the default Java version installed on the container image used 
> by the K8s integration tests are different depending on the way to run the 
> tests.
> If the tests are launched by Maven, the Java version is 8 is installed.
> On the other hand, if the tests are launched by SBT, the Java version is 11.
> Further, we have no way to change the version.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37319) Support K8s image building with Java 17

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37319:
--
Parent Issue: SPARK-38396  (was: SPARK-33772)

> Support K8s image building with Java 17
> ---
>
> Key: SPARK-37319
> URL: https://issues.apache.org/jira/browse/SPARK-37319
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38302) Use Java 17 in K8S integration tests when setting spark-tgz

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38302:
--
Parent Issue: SPARK-38396  (was: SPARK-33772)

> Use Java 17 in K8S integration tests when setting spark-tgz
> ---
>
> Key: SPARK-38302
> URL: https://issues.apache.org/jira/browse/SPARK-38302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: qian
>Assignee: qian
>Priority: Minor
> Fix For: 3.3.0
>
>
> When setting parameters `spark-tgz` during integration tests, the error that 
> `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`
>  cannot be found occurs. This is due to the default value of 
> `spark.kubernetes.test.dockerFile` being 
> `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`.
>  When using the tgz, the working directory is 
> `${spark.kubernetes.test.unpackSparkDir}`, and the relative path 
> `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17`
>  is invalid.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38269) Clean up redundant type cast

2022-03-02 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-38269.

Fix Version/s: 3.3.0
 Assignee: Yang Jie
   Resolution: Fixed

> Clean up redundant type cast
> 
>
> Key: SPARK-38269
> URL: https://issues.apache.org/jira/browse/SPARK-38269
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37529) Support K8s integration tests for Java 17

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37529:
--
Parent Issue: SPARK-38396  (was: SPARK-33772)

> Support K8s integration tests for Java 17
> -
>
> Key: SPARK-37529
> URL: https://issues.apache.org/jira/browse/SPARK-37529
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
>
> Now that we can build container image for Java 17, let's support K8s 
> integration tests for Java 17.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37645) Word spell error - "labeled" spells as "labled"

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37645:
--
Parent: SPARK-38396
Issue Type: Sub-task  (was: Improvement)

> Word spell error - "labeled" spells as "labled"
> ---
>
> Key: SPARK-37645
> URL: https://issues.apache.org/jira/browse/SPARK-37645
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: qian
>Assignee: qian
>Priority: Minor
> Fix For: 3.3.0
>
>
> Word spell error - "labeled" spells as "labled"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37998) Use `rbac.authorization.k8s.io/v1` instead of `v1beta1`

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37998:
--
Parent: SPARK-38396
Issue Type: Sub-task  (was: Test)

> Use `rbac.authorization.k8s.io/v1` instead of `v1beta1`
> ---
>
> Key: SPARK-37998
> URL: https://issues.apache.org/jira/browse/SPARK-37998
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Minor
> Fix For: 3.3.0, 3.2.2
>
>
> *$ kubectl version*
> Client Version: version.Info\{Major:"1", Minor:"22", GitVersion:"v1.22.3", 
> GitCommit:"c92036820499fedefec0f847e2054d824aea6cd1", GitTreeState:"clean", 
> BuildDate:"2021-10-27T18:41:28Z", GoVersion:"go1.16.9", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info\{Major:"1", Minor:"23", GitVersion:"v1.23.1", 
> GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", 
> BuildDate:"2021-12-16T11:34:54Z", GoVersion:"go1.17.5", Compiler:"gc", 
> Platform:"linux/amd64"}
> *$ k apply -f  
> resource-managers/kubernetes/integration-tests/dev/spark-rbac.yaml*
> namespace/spark created
> serviceaccount/spark-sa created
> *unable to recognize 
> "resource-managers/kubernetes/integration-tests/dev/spark-rbac.yaml": no 
> matches for kind "ClusterRole" in version "rbac.authorization.k8s.io/v1beta1"*
> *unable to recognize 
> "resource-managers/kubernetes/integration-tests/dev/spark-rbac.yaml": no 
> matches for kind "ClusterRoleBinding" in version 
> "rbac.authorization.k8s.io/v1beta1"*



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38081) Support cloud-backend in K8s IT with SBT

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38081:
--
Parent: SPARK-38396
Issue Type: Sub-task  (was: Test)

> Support cloud-backend in K8s IT with SBT
> 
>
> Key: SPARK-38081
> URL: https://issues.apache.org/jira/browse/SPARK-38081
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38241) Close KubernetesClient in K8S integrations tests

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38241:
--
Parent: SPARK-38396
Issue Type: Sub-task  (was: Task)

> Close KubernetesClient in K8S integrations tests
> 
>
> Key: SPARK-38241
> URL: https://issues.apache.org/jira/browse/SPARK-38241
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.1
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Martin Tzvetanov Grigorov
>Priority: Minor
> Fix For: 3.3.0
>
>
> The implementations of 
> org.apache.spark.deploy.k8s.integrationtest.backend.IntegrationTestBackend 
> should close their KubernetesClient instance in #cleanUp() method.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36553) KMeans fails with NegativeArraySizeException for K = 50000 after issue #27758 was introduced

2022-03-02 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-36553.

Fix Version/s: 3.1.3
   3.3.0
   3.2.2
 Assignee: zhengruifeng
   Resolution: Fixed

> KMeans fails with NegativeArraySizeException for K = 5 after issue #27758 
> was introduced
> 
>
> Key: SPARK-36553
> URL: https://issues.apache.org/jira/browse/SPARK-36553
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.1.1
>Reporter: Anders Rydbirk
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.1.3, 3.3.0, 3.2.2
>
>
> We are running KMeans on approximately 350M rows of x, y, z coordinates using 
> the following configuration:
> {code:java}
> KMeans(
>   featuresCol='features',
>   predictionCol='centroid_id',
>   k=5,
>   initMode='k-means||',
>   initSteps=2,
>   tol=0.5,
>   maxIter=20,
>   seed=SEED,
>   distanceMeasure='euclidean'
> )
> {code}
> When using Spark 3.0.0 this worked fine, but  when upgrading to 3.1.1 we are 
> consistently getting errors unless we reduce K.
> Stacktrace:
>  
> {code:java}
> An error occurred while calling o167.fit.An error occurred while calling 
> o167.fit.: java.lang.NegativeArraySizeException: -897458648 at 
> scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:194) at 
> scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:191) at 
> scala.Array$.ofDim(Array.scala:221) at 
> org.apache.spark.mllib.clustering.DistanceMeasure.computeStatistics(DistanceMeasure.scala:52)
>  at 
> org.apache.spark.mllib.clustering.KMeans.runAlgorithmWithWeight(KMeans.scala:280)
>  at org.apache.spark.mllib.clustering.KMeans.runWithWeight(KMeans.scala:231) 
> at org.apache.spark.ml.clustering.KMeans.$anonfun$fit$1(KMeans.scala:354) at 
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213) at 
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:329) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown 
> Source) at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
> Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:282) at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
> py4j.commands.CallCommand.execute(CallCommand.java:79) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.base/java.lang.Thread.run(Unknown Source)
> {code}
>  
> The issue is introduced by 
> [#27758|#diff-725d4624ddf4db9cc51721c2ddaef50a1bc30e7b471e0439da28c5b5582efdfdR52]]
>  which significantly reduces the maximum value of K. Snippit of line that 
> throws error from [DistanceMeasure.scala:|#L52]]
> {code:java}
> val packedValues = Array.ofDim[Double](k * (k + 1) / 2)
> {code}
>  
> *What we have tried:*
>  * Reducing iterations
>  * Reducing input volume
>  * Reducing K
> Only reducing K have yielded success.
>  
> *Possible workaround:*
>  # Roll back to Spark 3.0.0 since a KMeansModel generated with 3.0.0 cannot 
> be loaded in 3.1.1.
>  # Reduce K. Currently trying with 45000.
>  
> *What we don't understand*:
> Given the line of code above, we do not understand why we would get an 
> integer overflow.
> For K=50,000, packedValues should be allocated with the size of 1,250,025,000 
> < (2^31) and not result in a negative array size.
>  
> *Suggested resolution:*
> I'm not strong in the inner workings on KMeans, but my immediate thought 
> would be to add a fallback to previous logic for K larger than a set 
> threshold if the optimisation is to stay in place, as it breaks compatibility 
> from 3.0.0 to 3.1.1 for edge cases.
>  
> Please let me know if more information is needed, this is my first time 
> raising a bug for a OS.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38272) Use docker-desktop instead of docker-for-desktop for Docker K8S IT deployMode and context name

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38272:
--
Parent: SPARK-38396
Issue Type: Sub-task  (was: Improvement)

> Use docker-desktop instead of docker-for-desktop for Docker K8S IT deployMode 
> and context name 
> ---
>
> Key: SPARK-38272
> URL: https://issues.apache.org/jira/browse/SPARK-38272
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38022) Use relativePath for K8s remote file test in BasicTestsSuite

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38022:
--
Parent: SPARK-38396
Issue Type: Sub-task  (was: Test)

> Use relativePath for K8s remote file test in BasicTestsSuite
> 
>
> Key: SPARK-38022
> URL: https://issues.apache.org/jira/browse/SPARK-38022
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>
> *BEFORE*
> {code:java}
> $ build/sbt -Pkubernetes -Pkubernetes-integration-tests 
> -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17
>  -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test"
> ...
> [info] KubernetesSuite:
> ...
> [info] - Run SparkRemoteFileTest using a remote data file *** FAILED *** (3 
> minutes, 3 seconds)
> [info]   The code passed to eventually never returned normally. Attempted 190 
> times over 3.01226506667 minutes. Last failure message: false was not 
> true. (KubernetesSuite.scala:452)
> ... {code}
> *AFTER*
> {code:java}
> $ build/sbt -Pkubernetes -Pkubernetes-integration-tests 
> -Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17
>  -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test"
> ...
> [info] KubernetesSuite:
> ...
> [info] - Run SparkRemoteFileTest using a remote data file (8 seconds, 608 
> milliseconds){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38396) Improve K8s Integration Tests

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38396:
--
Description: 
This JIRA aims to improve K8s integration tests for the following recent 
activity.
- Java 17
- Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th)
- New K8s model (K8s clients)
- New custom schedulers

  was:
This JIRA aims to improve K8s integration tests for the following recent 
activity.
- Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th)
- New K8s model (K8s clients)
- New custom schedulers


> Improve K8s Integration Tests
> -
>
> Key: SPARK-38396
> URL: https://issues.apache.org/jira/browse/SPARK-38396
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
>
> This JIRA aims to improve K8s integration tests for the following recent 
> activity.
> - Java 17
> - Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th)
> - New K8s model (K8s clients)
> - New custom schedulers



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38396) Improve K8s Integration Tests

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38396:
--
Description: 
This JIRA aims to improve K8s integration tests for the following recent 
activity.
- Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th)
- New K8s model (K8s clients)
- New custom schedulers

> Improve K8s Integration Tests
> -
>
> Key: SPARK-38396
> URL: https://issues.apache.org/jira/browse/SPARK-38396
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
>
> This JIRA aims to improve K8s integration tests for the following recent 
> activity.
> - Cloud K8s testing (including K8s 1.22 GA on EKS on March 15th)
> - New K8s model (K8s clients)
> - New custom schedulers



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38244) Upgrade kubernetes-client to 5.12.1

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38244:
--
Parent: SPARK-38396
Issue Type: Sub-task  (was: Bug)

> Upgrade kubernetes-client to 5.12.1
> ---
>
> Key: SPARK-38244
> URL: https://issues.apache.org/jira/browse/SPARK-38244
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38396) Improve K8s Integration Tests

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38396:
--
Labels: releasenotes  (was: )

> Improve K8s Integration Tests
> -
>
> Key: SPARK-38396
> URL: https://issues.apache.org/jira/browse/SPARK-38396
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38396) Improve K8s Integration Tests

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38396:
--
Target Version/s: 3.3.0

> Improve K8s Integration Tests
> -
>
> Key: SPARK-38396
> URL: https://issues.apache.org/jira/browse/SPARK-38396
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38396) Improve K8s Integration Tests

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38396:
--
Affects Version/s: (was: 3.2.2)

> Improve K8s Integration Tests
> -
>
> Key: SPARK-38396
> URL: https://issues.apache.org/jira/browse/SPARK-38396
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38048) Add IntegrationTestBackend.describePods to support all K8s test backends

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38048:
--
Parent: SPARK-38396
Issue Type: Sub-task  (was: Test)

> Add IntegrationTestBackend.describePods to support all K8s test backends
> 
>
> Key: SPARK-38048
> URL: https://issues.apache.org/jira/browse/SPARK-38048
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38029) Support docker-desktop K8S integration test in SBT

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38029:
--
Parent: SPARK-38396
Issue Type: Sub-task  (was: Test)

> Support docker-desktop K8S integration test in SBT
> --
>
> Key: SPARK-38029
> URL: https://issues.apache.org/jira/browse/SPARK-38029
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38392) Add `spark-` prefix to namespaces and `-driver` suffix to drivers during IT

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38392:
--
Parent: SPARK-38396
Issue Type: Sub-task  (was: Improvement)

> Add `spark-` prefix to namespaces and `-driver` suffix to drivers during IT
> ---
>
> Key: SPARK-38392
> URL: https://issues.apache.org/jira/browse/SPARK-38392
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Martin Tzvetanov Grigorov
>Priority: Minor
> Fix For: 3.3.0, 3.2.2
>
>
> Currently when there is no configured K8S namespace by the user the 
> Kubernetes IT tests use UUID.toString() without the '-'es:
> {code:java}
> val namespace = 
> namespaceOption.getOrElse(UUID.randomUUID().toString.replaceAll("-", "")) 
> {code}
>  
> Proposal: prefix the temporary namespace with "spark-" or "spark-test-".
>  
> The name of the driver pod is:
> {code:java}
> driverPodName = "spark-test-app-" + 
> UUID.randomUUID().toString.replaceAll("-", "") {code}
> i.e. it does not mention that it is the driver.
> For non-test it uses names like `spark-on-k8s-app-f6bfc57f4a938abe-driver`, 
> i.e. the value of "–name" config + hash + "-driver".
> In both IT tests and non-test the executor pods always mention "-exec-", so 
> they are clear.
>  
> Proposal: unify non-test and IT test name and use "spark-test-app-"+ hash + 
> "-driver" for the IT tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38072) Support K8s imageTag parameter in SBT K8s IT

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38072:
--
Parent: SPARK-38396
Issue Type: Sub-task  (was: Test)

> Support K8s imageTag parameter in SBT K8s IT
> 
>
> Key: SPARK-38072
> URL: https://issues.apache.org/jira/browse/SPARK-38072
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38071) Support K8s namespace parameter in SBT K8s IT

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38071:
--
Parent: SPARK-38396
Issue Type: Sub-task  (was: Test)

> Support K8s namespace parameter in SBT K8s IT
> -
>
> Key: SPARK-38071
> URL: https://issues.apache.org/jira/browse/SPARK-38071
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38396) Improve K8s Integration Tests

2022-03-02 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-38396:
-

 Summary: Improve K8s Integration Tests
 Key: SPARK-38396
 URL: https://issues.apache.org/jira/browse/SPARK-38396
 Project: Spark
  Issue Type: Umbrella
  Components: Kubernetes, Tests
Affects Versions: 3.3.0, 3.2.2
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38048) Add IntegrationTestBackend.describePods to support all K8s test backends

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38048:
--
Fix Version/s: 3.2.2

> Add IntegrationTestBackend.describePods to support all K8s test backends
> 
>
> Key: SPARK-38048
> URL: https://issues.apache.org/jira/browse/SPARK-38048
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38379) Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes

2022-03-02 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500354#comment-17500354
 ] 

Thomas Graves commented on SPARK-38379:
---

just going by the stack trace this looks related to change 
https://issues.apache.org/jira/browse/SPARK-35182

[~dongjoon] Just curious if you have run into this?

> Kubernetes: NoSuchElementException: spark.app.id when using PersistentVolumes 
> --
>
> Key: SPARK-38379
> URL: https://issues.apache.org/jira/browse/SPARK-38379
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: Thomas Graves
>Priority: Major
>
> I'm using Spark 3.2.1 on a kubernetes cluster and starting a spark-shell in 
> client mode.  I'm using persistent local volumes to mount nvme under /data in 
> the executors and on startup the driver always throws the warning below.
> using these options:
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=fast-disks
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=500Gi
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data
>  \
>      --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false
>  
>  
> {code:java}
> 22/03/01 20:21:22 WARN ExecutorPodsSnapshotsStoreImpl: Exception when 
> notifying snapshot subscriber.
> java.util.NoSuchElementException: spark.app.id
>         at org.apache.spark.SparkConf.$anonfun$get$1(SparkConf.scala:245)
>         at scala.Option.getOrElse(Option.scala:189)
>         at org.apache.spark.SparkConf.get(SparkConf.scala:245)
>         at org.apache.spark.SparkConf.getAppId(SparkConf.scala:450)
>         at 
> org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.$anonfun$constructVolumes$4(MountVolumesFeatureStep.scala:88)
>         at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>         at scala.collection.Iterator.foreach(Iterator.scala:943)
>         at scala.collection.Iterator.foreach$(Iterator.scala:943)
>         at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>         at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>         at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>         at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>         at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>         at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>         at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>         at 
> org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.constructVolumes(MountVolumesFeatureStep.scala:57)
>         at 
> org.apache.spark.deploy.k8s.features.MountVolumesFeatureStep.configurePod(MountVolumesFeatureStep.scala:34)
>         at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.$anonfun$buildFromFeatures$4(KubernetesExecutorBuilder.scala:64)
>         at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
>         at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
>         at scala.collection.immutable.List.foldLeft(List.scala:91)
>         at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesExecutorBuilder.buildFromFeatures(KubernetesExecutorBuilder.scala:63)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:391)
>         at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339)
>         at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>         at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:339)
>         at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3(ExecutorPodsAllocator.scala:117)
>

[jira] [Updated] (SPARK-38072) Support K8s imageTag parameter in SBT K8s IT

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38072:
--
Fix Version/s: 3.2.2

> Support K8s imageTag parameter in SBT K8s IT
> 
>
> Key: SPARK-38072
> URL: https://issues.apache.org/jira/browse/SPARK-38072
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38071) Support K8s namespace parameter in SBT K8s IT

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38071:
--
Fix Version/s: 3.2.2

> Support K8s namespace parameter in SBT K8s IT
> -
>
> Key: SPARK-38071
> URL: https://issues.apache.org/jira/browse/SPARK-38071
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38029) Support docker-desktop K8S integration test in SBT

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38029:
--
Fix Version/s: 3.2.2

> Support docker-desktop K8S integration test in SBT
> --
>
> Key: SPARK-38029
> URL: https://issues.apache.org/jira/browse/SPARK-38029
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38353) Instrument enter and exit magic methods for Pandas module

2022-03-02 Thread Yihong He (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yihong He updated SPARK-38353:
--
Description: 
Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and 
{_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve accuracy 
of the usage data. Besides, we are interested in extending the Pandas usage 
logger to other PySpark modules in the future so it will help improve accuracy 
of usage data of other PySpark modules.

For example, for the following code:

 
{code:java}
pdf = pd.DataFrame(
[(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
self.assert_eq(
repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
False, True))
){code}
 

pandas usage logger records the internal call 
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
 since __enter__ and __exit__ methods of 
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
 are not instrumented.

  was:
Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and 
{_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve accuracy 
of the usage data. Besides, we are interested in extending the Pandas usage 
logger to other PySpark modules in the future so it will help improve accuracy 
of usage data of other PySpark modules.

For example, for the following code:

 
{code:java}
pdf = pd.DataFrame(
[(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
self.assert_eq(
repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
False, True))
){code}
 

pandas usage logger records 
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
 since {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} methods of 
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
 are not instrumented.


> Instrument __enter__ and __exit__ magic methods for Pandas module
> -
>
> Key: SPARK-38353
> URL: https://issues.apache.org/jira/browse/SPARK-38353
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Yihong He
>Priority: Minor
>
> Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and 
> {_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve 
> accuracy of the usage data. Besides, we are interested in extending the 
> Pandas usage logger to other PySpark modules in the future so it will help 
> improve accuracy of usage data of other PySpark modules.
> For example, for the following code:
>  
> {code:java}
> pdf = pd.DataFrame(
> [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
> )
> psdf = ps.from_pandas(pdf)
> with psdf.spark.cache() as cached_df:
> self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
> self.assert_eq(
> repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
> False, True))
> ){code}
>  
> pandas usage logger records the internal call 
> [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
>  since __enter__ and __exit__ methods of 
> [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
>  are not instrumented.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38353) Instrument enter and exit magic methods for Pandas module

2022-03-02 Thread Yihong He (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yihong He updated SPARK-38353:
--
Description: 
Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and 
{_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve accuracy 
of the usage data. Besides, we are interested in extending the Pandas usage 
logger to other PySpark modules in the future so it will help improve accuracy 
of usage data of other PySpark modules.

For example, for the following code:

 
{code:java}
pdf = pd.DataFrame(
[(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
self.assert_eq(
repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
False, True))
){code}
 

pandas usage logger records 
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
 since {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} methods of 
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
 are not instrumented.

  was:
Create the ticket since instrumenting _{_}enter{_}_ and _{_}exit{_}_ magic 
methods for Pandas module can help improve accuracy of the usage data. Besides, 
we are interested in extend the Pandas usage logger to other PySpark modules in 
the future so it will help improve accuracy of usage data of other PySpark 
modules.

For example, for the following code:

 
{code:java}
pdf = pd.DataFrame(
[(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
self.assert_eq(
repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
False, True))
){code}
 

pandas usage logger records 
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
 since {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} methods of 
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
 are not instrumented.


> Instrument __enter__ and __exit__ magic methods for Pandas module
> -
>
> Key: SPARK-38353
> URL: https://issues.apache.org/jira/browse/SPARK-38353
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Yihong He
>Priority: Minor
>
> Create the ticket since instrumenting {_}{{_}}enter{{_}}{_} and 
> {_}{{_}}exit{{_}}{_} magic methods for Pandas module can help improve 
> accuracy of the usage data. Besides, we are interested in extending the 
> Pandas usage logger to other PySpark modules in the future so it will help 
> improve accuracy of usage data of other PySpark modules.
> For example, for the following code:
>  
> {code:java}
> pdf = pd.DataFrame(
> [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
> )
> psdf = ps.from_pandas(pdf)
> with psdf.spark.cache() as cached_df:
> self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
> self.assert_eq(
> repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
> False, True))
> ){code}
>  
> pandas usage logger records 
> [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
>  since {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} methods of 
> [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
>  are not instrumented.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38353) Instrument enter and exit magic methods for Pandas module

2022-03-02 Thread Yihong He (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yihong He updated SPARK-38353:
--
Description: 
Create the ticket since instrumenting _{_}enter{_}_ and _{_}exit{_}_ magic 
methods for Pandas module can help improve accuracy of the usage data. Besides, 
we are interested in extend the Pandas usage logger to other PySpark modules in 
the future so it will help improve accuracy of usage data of other PySpark 
modules.

For example, for the following code:

 
{code:java}
pdf = pd.DataFrame(
[(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
self.assert_eq(
repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
False, True))
){code}
 

pandas usage logger records 
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
 since {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} methods of 
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
 are not instrumented.

  was:
For example, for the following code:

 
{code:java}
pdf = pd.DataFrame(
[(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
self.assert_eq(
repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
False, True))
){code}
 

pandas usage logger records 
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
 since _{_}enter{_}_ and _{_}exit{_}_ methods of 
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
 are not instrumented.

So instrumenting __enter__ and __exit__ magic methods for Pandas module can 
help improve accuracy of the usage data


> Instrument __enter__ and __exit__ magic methods for Pandas module
> -
>
> Key: SPARK-38353
> URL: https://issues.apache.org/jira/browse/SPARK-38353
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Yihong He
>Priority: Minor
>
> Create the ticket since instrumenting _{_}enter{_}_ and _{_}exit{_}_ magic 
> methods for Pandas module can help improve accuracy of the usage data. 
> Besides, we are interested in extend the Pandas usage logger to other PySpark 
> modules in the future so it will help improve accuracy of usage data of other 
> PySpark modules.
> For example, for the following code:
>  
> {code:java}
> pdf = pd.DataFrame(
> [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
> )
> psdf = ps.from_pandas(pdf)
> with psdf.spark.cache() as cached_df:
> self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
> self.assert_eq(
> repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
> False, True))
> ){code}
>  
> pandas usage logger records 
> [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
>  since {_}{{_}}enter{{_}}{_} and {_}{{_}}exit{{_}}{_} methods of 
> [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
>  are not instrumented.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38392) [K8S] Improve the names used for K8S namespace and driver pod in integration tests

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38392:
--
Component/s: Tests

> [K8S] Improve the names used for K8S namespace and driver pod in integration 
> tests
> --
>
> Key: SPARK-38392
> URL: https://issues.apache.org/jira/browse/SPARK-38392
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Martin Tzvetanov Grigorov
>Priority: Minor
> Fix For: 3.3.0, 3.2.2
>
>
> Currently when there is no configured K8S namespace by the user the 
> Kubernetes IT tests use UUID.toString() without the '-'es:
> {code:java}
> val namespace = 
> namespaceOption.getOrElse(UUID.randomUUID().toString.replaceAll("-", "")) 
> {code}
>  
> Proposal: prefix the temporary namespace with "spark-" or "spark-test-".
>  
> The name of the driver pod is:
> {code:java}
> driverPodName = "spark-test-app-" + 
> UUID.randomUUID().toString.replaceAll("-", "") {code}
> i.e. it does not mention that it is the driver.
> For non-test it uses names like `spark-on-k8s-app-f6bfc57f4a938abe-driver`, 
> i.e. the value of "–name" config + hash + "-driver".
> In both IT tests and non-test the executor pods always mention "-exec-", so 
> they are clear.
>  
> Proposal: unify non-test and IT test name and use "spark-test-app-"+ hash + 
> "-driver" for the IT tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20050) Kafka 0.10 DirectStream doesn't commit last processed batch's offset when graceful shutdown

2022-03-02 Thread James Grinter (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-20050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500321#comment-17500321
 ] 

James Grinter commented on SPARK-20050:
---

You're just working around a genuine bug, and behaviour that is documented as 
how you're supposed to do it (and using the offsets is a better way, as 
monitoring tools can then see the consumer-group progress and detect that 
there's a problem, instead of it being hidden away in a private implementation).

> Kafka 0.10 DirectStream doesn't commit last processed batch's offset when 
> graceful shutdown
> ---
>
> Key: SPARK-20050
> URL: https://issues.apache.org/jira/browse/SPARK-20050
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sasaki Toru
>Priority: Major
>
> I use Kafka 0.10 DirectStream with properties 'enable.auto.commit=false' and 
> call 'DirectKafkaInputDStream#commitAsync' finally in each batches,  such 
> below
> {code}
> val kafkaStream = KafkaUtils.createDirectStream[String, String](...)
> kafkaStream.map { input =>
>   "key: " + input.key.toString + " value: " + input.value.toString + " 
> offset: " + input.offset.toString
>   }.foreachRDD { rdd =>
> rdd.foreach { input =>
> println(input)
>   }
> }
> kafkaStream.foreachRDD { rdd =>
>   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>   kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
> }
> {\code}
> Some records which processed in the last batch before Streaming graceful 
> shutdown reprocess in the first batch after Spark Streaming restart, such 
> below
> * output first run of this application
> {code}
> key: null value: 1 offset: 101452472
> key: null value: 2 offset: 101452473
> key: null value: 3 offset: 101452474
> key: null value: 4 offset: 101452475
> key: null value: 5 offset: 101452476
> key: null value: 6 offset: 101452477
> key: null value: 7 offset: 101452478
> key: null value: 8 offset: 101452479
> key: null value: 9 offset: 101452480  // this is a last record before 
> shutdown Spark Streaming gracefully
> {\code}
> * output re-run of this application
> {code}
> key: null value: 7 offset: 101452478   // duplication
> key: null value: 8 offset: 101452479   // duplication
> key: null value: 9 offset: 101452480   // duplication
> key: null value: 10 offset: 101452481
> {\code}
> It may cause offsets specified in commitAsync will commit in the head of next 
> batch.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38353) Instrument enter and exit magic methods for Pandas module

2022-03-02 Thread Yihong He (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yihong He updated SPARK-38353:
--
Description: 
For example, for the following code:

 
{code:java}
pdf = pd.DataFrame(
[(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
self.assert_eq(
repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
False, True))
){code}
 

pandas usage logger records 
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
 since __enter__ and __exit__ methods of 
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
 are not instrumented.

 

 

> Instrument __enter__ and __exit__ magic methods for Pandas module
> -
>
> Key: SPARK-38353
> URL: https://issues.apache.org/jira/browse/SPARK-38353
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Yihong He
>Priority: Minor
>
> For example, for the following code:
>  
> {code:java}
> pdf = pd.DataFrame(
> [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
> )
> psdf = ps.from_pandas(pdf)
> with psdf.spark.cache() as cached_df:
> self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
> self.assert_eq(
> repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
> False, True))
> ){code}
>  
> pandas usage logger records 
> [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
>  since __enter__ and __exit__ methods of 
> [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
>  are not instrumented.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38392) Add spark- prefix to namespaces and -driver suffix to drivers during IT

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38392:
--
Summary: Add spark- prefix to namespaces and -driver suffix to drivers 
during IT  (was: [K8S] Improve the names used for K8S namespace and driver pod 
in integration tests)

> Add spark- prefix to namespaces and -driver suffix to drivers during IT
> ---
>
> Key: SPARK-38392
> URL: https://issues.apache.org/jira/browse/SPARK-38392
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Martin Tzvetanov Grigorov
>Priority: Minor
> Fix For: 3.3.0, 3.2.2
>
>
> Currently when there is no configured K8S namespace by the user the 
> Kubernetes IT tests use UUID.toString() without the '-'es:
> {code:java}
> val namespace = 
> namespaceOption.getOrElse(UUID.randomUUID().toString.replaceAll("-", "")) 
> {code}
>  
> Proposal: prefix the temporary namespace with "spark-" or "spark-test-".
>  
> The name of the driver pod is:
> {code:java}
> driverPodName = "spark-test-app-" + 
> UUID.randomUUID().toString.replaceAll("-", "") {code}
> i.e. it does not mention that it is the driver.
> For non-test it uses names like `spark-on-k8s-app-f6bfc57f4a938abe-driver`, 
> i.e. the value of "–name" config + hash + "-driver".
> In both IT tests and non-test the executor pods always mention "-exec-", so 
> they are clear.
>  
> Proposal: unify non-test and IT test name and use "spark-test-app-"+ hash + 
> "-driver" for the IT tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38353) Instrument enter and exit magic methods for Pandas module

2022-03-02 Thread Yihong He (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yihong He updated SPARK-38353:
--
Description: 
For example, for the following code:

 
{code:java}
pdf = pd.DataFrame(
[(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
self.assert_eq(
repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
False, True))
){code}
 

pandas usage logger records 
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
 since _{_}enter{_}_ and _{_}exit{_}_ methods of 
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
 are not instrumented.

So instrumenting __enter__ and __exit__ magic methods for Pandas module can 
help improve accuracy of the usage data

  was:
For example, for the following code:

 
{code:java}
pdf = pd.DataFrame(
[(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
)
psdf = ps.from_pandas(pdf)

with psdf.spark.cache() as cached_df:
self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
self.assert_eq(
repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
False, True))
){code}
 

pandas usage logger records 
[self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
 since __enter__ and __exit__ methods of 
[CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
 are not instrumented.

 

 


> Instrument __enter__ and __exit__ magic methods for Pandas module
> -
>
> Key: SPARK-38353
> URL: https://issues.apache.org/jira/browse/SPARK-38353
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Yihong He
>Priority: Minor
>
> For example, for the following code:
>  
> {code:java}
> pdf = pd.DataFrame(
> [(0.2, 0.3), (0.0, 0.6), (0.6, 0.0), (0.2, 0.1)], columns=["dogs", "cats"]
> )
> psdf = ps.from_pandas(pdf)
> with psdf.spark.cache() as cached_df:
> self.assert_eq(isinstance(cached_df, CachedDataFrame), True)
> self.assert_eq(
> repr(cached_df.spark.storage_level), repr(StorageLevel(True, True, 
> False, True))
> ){code}
>  
> pandas usage logger records 
> [self.spark.unpersist()|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12518]
>  since _{_}enter{_}_ and _{_}exit{_}_ methods of 
> [CachedDataFrame|https://github.com/apache/spark/blob/master/python/pyspark/pandas/frame.py#L12492]
>  are not instrumented.
> So instrumenting __enter__ and __exit__ magic methods for Pandas module can 
> help improve accuracy of the usage data



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38392) Add `spark-` prefix to namespaces and `-driver` suffix to drivers during IT

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38392:
--
Summary: Add `spark-` prefix to namespaces and `-driver` suffix to drivers 
during IT  (was: Add spark- prefix to namespaces and -driver suffix to drivers 
during IT)

> Add `spark-` prefix to namespaces and `-driver` suffix to drivers during IT
> ---
>
> Key: SPARK-38392
> URL: https://issues.apache.org/jira/browse/SPARK-38392
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Martin Tzvetanov Grigorov
>Priority: Minor
> Fix For: 3.3.0, 3.2.2
>
>
> Currently when there is no configured K8S namespace by the user the 
> Kubernetes IT tests use UUID.toString() without the '-'es:
> {code:java}
> val namespace = 
> namespaceOption.getOrElse(UUID.randomUUID().toString.replaceAll("-", "")) 
> {code}
>  
> Proposal: prefix the temporary namespace with "spark-" or "spark-test-".
>  
> The name of the driver pod is:
> {code:java}
> driverPodName = "spark-test-app-" + 
> UUID.randomUUID().toString.replaceAll("-", "") {code}
> i.e. it does not mention that it is the driver.
> For non-test it uses names like `spark-on-k8s-app-f6bfc57f4a938abe-driver`, 
> i.e. the value of "–name" config + hash + "-driver".
> In both IT tests and non-test the executor pods always mention "-exec-", so 
> they are clear.
>  
> Proposal: unify non-test and IT test name and use "spark-test-app-"+ hash + 
> "-driver" for the IT tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38392) [K8S] Improve the names used for K8S namespace and driver pod in integration tests

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38392:
-

Assignee: Martin Tzvetanov Grigorov

> [K8S] Improve the names used for K8S namespace and driver pod in integration 
> tests
> --
>
> Key: SPARK-38392
> URL: https://issues.apache.org/jira/browse/SPARK-38392
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Martin Tzvetanov Grigorov
>Priority: Minor
>
> Currently when there is no configured K8S namespace by the user the 
> Kubernetes IT tests use UUID.toString() without the '-'es:
> {code:java}
> val namespace = 
> namespaceOption.getOrElse(UUID.randomUUID().toString.replaceAll("-", "")) 
> {code}
>  
> Proposal: prefix the temporary namespace with "spark-" or "spark-test-".
>  
> The name of the driver pod is:
> {code:java}
> driverPodName = "spark-test-app-" + 
> UUID.randomUUID().toString.replaceAll("-", "") {code}
> i.e. it does not mention that it is the driver.
> For non-test it uses names like `spark-on-k8s-app-f6bfc57f4a938abe-driver`, 
> i.e. the value of "–name" config + hash + "-driver".
> In both IT tests and non-test the executor pods always mention "-exec-", so 
> they are clear.
>  
> Proposal: unify non-test and IT test name and use "spark-test-app-"+ hash + 
> "-driver" for the IT tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38392) [K8S] Improve the names used for K8S namespace and driver pod in integration tests

2022-03-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38392.
---
Fix Version/s: 3.3.0
   3.2.2
   Resolution: Fixed

Issue resolved by pull request 35711
[https://github.com/apache/spark/pull/35711]

> [K8S] Improve the names used for K8S namespace and driver pod in integration 
> tests
> --
>
> Key: SPARK-38392
> URL: https://issues.apache.org/jira/browse/SPARK-38392
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Martin Tzvetanov Grigorov
>Priority: Minor
> Fix For: 3.3.0, 3.2.2
>
>
> Currently when there is no configured K8S namespace by the user the 
> Kubernetes IT tests use UUID.toString() without the '-'es:
> {code:java}
> val namespace = 
> namespaceOption.getOrElse(UUID.randomUUID().toString.replaceAll("-", "")) 
> {code}
>  
> Proposal: prefix the temporary namespace with "spark-" or "spark-test-".
>  
> The name of the driver pod is:
> {code:java}
> driverPodName = "spark-test-app-" + 
> UUID.randomUUID().toString.replaceAll("-", "") {code}
> i.e. it does not mention that it is the driver.
> For non-test it uses names like `spark-on-k8s-app-f6bfc57f4a938abe-driver`, 
> i.e. the value of "–name" config + hash + "-driver".
> In both IT tests and non-test the executor pods always mention "-exec-", so 
> they are clear.
>  
> Proposal: unify non-test and IT test name and use "spark-test-app-"+ hash + 
> "-driver" for the IT tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38395) Pyspark issue in resolving column when there is dot (.)

2022-03-02 Thread Krishna Sangeeth KS (Jira)

Krishna Sangeeth KS created SPARK-38395:
---

 Summary: Pyspark issue in resolving column when there is dot (.)
 Key: SPARK-38395
 URL: https://issues.apache.org/jira/browse/SPARK-38395
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.0
 Environment: Issue found in Mac OS Catalina, Pyspark 3.0
Reporter: Krishna Sangeeth KS


Pyspark apply in pandas have some difficult in resolving columns when there is 
dot in the column name. 

Here is an example that I have which reproduces the issue.  Example taken by 
modifying doctest example 
[here|https://github.com/apache/spark/blob/branch-3.0/python/pyspark/sql/pandas/group_ops.py#L237-L248]
{code:python}
df1 = spark.createDataFrame(
[(2101, 1, 1.0), (2101, 2, 2.0), (2102, 1, 3.0), (2102, 2, 
4.0)],
("abc|database|10.159.154|xef", "id", "v1"))
df2 = spark.createDataFrame(
[(2101, 1, "x"), (2101, 2, "y")],
("abc|database|10.159.154|xef", "id", "v2"))
def asof_join(l, r):
return pd.merge_asof(l, r, on="abc|database|10.159.154|xef", by="id")
df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
asof_join, schema="`abc|database|10.159.154|xef` int, id int, v1 double, v2 
string").show(){code}
This gives the below error
{code:python}
AnalysisException Traceback (most recent call last)
 in 
  8 return pd.merge_asof(l, r, on="abc|database|10.159.154|xef", 
by="id")
  9 df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
---> 10 asof_join, schema="`abc|database|10.159.154|xef` int, id int, v1 
double, v2 string").show()

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/pandas/group_ops.py
 in applyInPandas(self, func, schema)
295 udf = pandas_udf(
296 func, returnType=schema, 
functionType=PythonEvalType.SQL_COGROUPED_MAP_PANDAS_UDF)
--> 297 all_cols = self._extract_cols(self._gd1) + 
self._extract_cols(self._gd2)
298 udf_column = udf(*all_cols)
299 jdf = self._gd1._jgd.flatMapCoGroupsInPandas(self._gd2._jgd, 
udf_column._jc.expr())

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/pandas/group_ops.py
 in _extract_cols(gd)
303 def _extract_cols(gd):
304 df = gd._df
--> 305 return [df[col] for col in df.columns]
306 
307 

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/pandas/group_ops.py
 in (.0)
303 def _extract_cols(gd):
304 df = gd._df
--> 305 return [df[col] for col in df.columns]
306 
307 

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/dataframe.py in 
__getitem__(self, item)
   1378 """
   1379 if isinstance(item, basestring):
-> 1380 jc = self._jdf.apply(item)
   1381 return Column(jc)
   1382 elif isinstance(item, Column):

~/anaconda3/envs/py37/lib/python3.7/site-packages/py4j/java_gateway.py in 
__call__(self, *args)
   1303 answer = self.gateway_client.send_command(command)
   1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307 for temp_arg in temp_args:

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/utils.py in 
deco(*a, **kw)
135 # Hide where the exception came from that shows a 
non-Pythonic
136 # JVM exception message.
--> 137 raise_from(converted)
138 else:
139 raise

~/anaconda3/envs/py37/lib/python3.7/site-packages/pyspark/sql/utils.py in 
raise_from(e)

AnalysisException: Cannot resolve column name "abc|database|10.159.154|xef" 
among (abc|database|10.159.154|xef, id, v1); did you mean to quote the 
`abc|database|10.159.154|xef` column?;

{code}
As we can see the column is present there in the `among` list.

When i replace `.` (dot) with `_` (underscore) the code actually works.
{code:python}
df1 = spark.createDataFrame(
[(2101, 1, 1.0), (2101, 2, 2.0), (2102, 1, 3.0), (2102, 2, 
4.0)],
("abc|database|10_159_154|xef", "id", "v1"))
df2 = spark.createDataFrame(
[(2101, 1, "x"), (2101, 2, "y")],
("abc|database|10_159_154|xef", "id", "v2"))
def asof_join(l, r):
return pd.merge_asof(l, r, on="abc|database|10_159_154|xef", by="id")
df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas(
asof_join, schema="`abc|database|10_159_154|xef` int, id int, v1 double, v2 
string").show()
{code}
{code:java}
+---+---+---+---+
|abc|database|10_159_154|xef| id| v1| v2|
+---+---+---+---+
|   2101|  1|1.0|  x|
|   2102|  1|3.0|  x|
|   2101|  2|2.0|  y|
|   2102|  2|4.0|  y|
+---+---+---+---+
{code}



--
This messag

[jira] [Updated] (SPARK-38388) Repartition + Stage retries could lead to incorrect data

2022-03-02 Thread Jason Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Xu updated SPARK-38388:
-
Description: 
Spark repartition uses RoundRobinPartitioning, the generated results is 
non-deterministic when data has some randomness and stage/task retries happen.

The bug can be triggered when upstream data has some randomness, a repartition 
is called on them, then followed by result stage (could be more stages).
As the pattern shows below:
upstream stage (data with randomness) -> (repartition shuffle) -> result stage

When one executor goes down at result stage, some tasks of that stage might 
have finished, others would fail, shuffle files on that executor also get lost, 
some tasks from previous stage (upstream data generation, repartition) will 
need to rerun to generate dependent shuffle data files.
Because data has some randomness, regenerated data in upstream retried tasks is 
slightly different, repartition then generates inconsistent ordering, then 
tasks at result stage will be retried generating different data.

This is similar but different to 
https://issues.apache.org/jira/browse/SPARK-23207, fix for it uses extra local 
sort to make the row ordering deterministic, the sorting algorithm it uses 
simply compares row/record binaries. But in this case, upstream data has some 
randomness, the sorting algorithm doesn't help keep the order, thus 
RoundRobinPartitioning introduced non-deterministic result.

The following code returns 986415, instead of 100:
{code:java}
import scala.sys.process._
import org.apache.spark.TaskContext

case class TestObject(id: Long, value: Double)

val ds = spark.range(0, 1000 * 1000, 1).repartition(100, 
$"id").withColumn("val", rand()).repartition(100).map { 
  row => if (TaskContext.get.stageAttemptNumber == 0 && 
TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 97) {
throw new Exception("pkill -f java".!!)
  }
  TestObject(row.getLong(0), row.getDouble(1))
}

ds.toDF("id", "value").write.mode("overwrite").saveAsTable("tmp.test_table")

spark.sql("select count(distinct id) from tmp.test_table").show{code}
Command: 
{code:java}
spark-shell --num-executors 10 (--conf spark.dynamicAllocation.enabled=false 
--conf spark.shuffle.service.enabled=false){code}
To simulate the issue, disable external shuffle service is needed (if it's also 
enabled by default in your environment),  this is to trigger shuffle file loss 
and previous stage retries.
In our production, we have external shuffle service enabled, this data 
correctness issue happened when there were node losses.

Although there's some non-deterministic factor in upstream data, user wouldn't 
expect  to see incorrect result.

  was:
Spark repartition uses RoundRobinPartitioning, the generated results is 
non-deterministic when data has some randomness and stage/task retries happen.

The bug can be triggered when upstream data has some randomness, a repartition 
is called on them, then followed by result stage (could be more stages).
As the pattern shows below:
upstream stage (data with randomness) -> (repartition shuffle) -> result stage

When one executor goes down at result stage, some tasks of that stage might 
have finished, others would fail, shuffle files on that executor also get lost, 
some tasks from previous stage (upstream data generation, repartition) will 
need to rerun to generate dependent shuffle data files.
Because data has some randomness, regenerated data in upstream retried tasks is 
slightly different, repartition then generates inconsistent ordering, then 
tasks at result stage will be retried generating different data.

This is similar but different to 
https://issues.apache.org/jira/browse/SPARK-23207, fix for it uses extra local 
sort to make the row ordering deterministic, the sorting algorithm it uses 
simply compares row/record binaries. But in this case, upstream data has some 
randomness, the sorting algorithm doesn't help keep the order, thus 
RoundRobinPartitioning introduced non-deterministic result.

The following code returns 998818, instead of 100:
{code:java}
import scala.sys.process._
import org.apache.spark.TaskContext

case class TestObject(id: Long, value: Double)

val ds = spark.range(0, 1000 * 1000, 1).repartition(100, 
$"id").withColumn("val", rand()).repartition(100).map { 
  row => if (TaskContext.get.stageAttemptNumber == 0 && 
TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 97) {
throw new Exception("pkill -f java".!!)
  }
  TestObject(row.getLong(0), row.getDouble(1))
}

ds.toDF("id", "value").write.mode("overwrite").saveAsTable("tmp.test_table")

spark.sql("select count(distinct id) from tmp.test_table").show{code}
Command: 
{code:java}
spark-shell --num-executors 10 (--conf spark.dynamicAllocation.enabled=false 
--conf spark.shuffle.service.enabled=false){code}
To simulate the issue, disable external shuffle service

[jira] [Updated] (SPARK-38388) Repartition + Stage retries could lead to incorrect data

2022-03-02 Thread Jason Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Xu updated SPARK-38388:
-
Description: 
Spark repartition uses RoundRobinPartitioning, the generated results is 
non-deterministic when data has some randomness and stage/task retries happen.

The bug can be triggered when upstream data has some randomness, a repartition 
is called on them, then followed by result stage (could be more stages).
As the pattern shows below:
upstream stage (data with randomness) -> (repartition shuffle) -> result stage

When one executor goes down at result stage, some tasks of that stage might 
have finished, others would fail, shuffle files on that executor also get lost, 
some tasks from previous stage (upstream data generation, repartition) will 
need to rerun to generate dependent shuffle data files.
Because data has some randomness, regenerated data in upstream retried tasks is 
slightly different, repartition then generates inconsistent ordering, then 
tasks at result stage will be retried generating different data.

This is similar but different to 
https://issues.apache.org/jira/browse/SPARK-23207, fix for it uses extra local 
sort to make the row ordering deterministic, the sorting algorithm it uses 
simply compares row/record binaries. But in this case, upstream data has some 
randomness, the sorting algorithm doesn't help keep the order, thus 
RoundRobinPartitioning introduced non-deterministic result.

The following code returns 998818, instead of 100:
{code:java}
import scala.sys.process._
import org.apache.spark.TaskContext

case class TestObject(id: Long, value: Double)

val ds = spark.range(0, 1000 * 1000, 1).repartition(100, 
$"id").withColumn("val", rand()).repartition(100).map { 
  row => if (TaskContext.get.stageAttemptNumber == 0 && 
TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 97) {
throw new Exception("pkill -f java".!!)
  }
  TestObject(row.getLong(0), row.getDouble(1))
}

ds.toDF("id", "value").write.mode("overwrite").saveAsTable("tmp.test_table")

spark.sql("select count(distinct id) from tmp.test_table").show{code}
Command: 
{code:java}
spark-shell --num-executors 10 (--conf spark.dynamicAllocation.enabled=false 
--conf spark.shuffle.service.enabled=false){code}
To simulate the issue, disable external shuffle service is needed (if it's also 
enabled by default in your environment),  this is to trigger shuffle file loss 
and previous stage retries.
In our production, we have external shuffle service enabled, this data 
correctness issue happened when there were node losses.

Although there's some non-deterministic factor in upstream data, user wouldn't 
expect  to see incorrect result.

  was:
Spark repartition uses RoundRobinPartitioning, the generated results is 
non-deterministic when data has some randomness and stage/task retries happen.

The bug can be triggered when upstream data has some randomness, a repartition 
is called on them, then followed by result stage (could be more stages).
As the pattern shows below:
upstream stage (data with randomness) -> (repartition shuffle) -> result stage

When one executor goes down at result stage, some tasks of that stage might 
have finished, others would fail, shuffle files on that executor also get lost, 
some tasks from previous stage (upstream data generation, repartition) will 
need to rerun to generate dependent shuffle data files.
Because data has some randomness, regenerated data in upstream retried tasks is 
slightly different, repartition then generates inconsistent ordering, then 
tasks at result stage will be retried generating different data.

This is similar but different to 
https://issues.apache.org/jira/browse/SPARK-23207, fix for it uses extra local 
sort to make the row ordering deterministic, the sorting algorithm it uses 
simply compares row/record binaries. But in this case, upstream data has some 
randomness, the sorting algorithm doesn't help keep the order, thus 
RoundRobinPartitioning introduced non-deterministic result.

The following code returns 998818, instead of 100:
{code:java}
import scala.sys.process._
import org.apache.spark.TaskContext

case class TestObject(id: Long, value: Double)

val ds = spark.range(0, 1000 * 1000, 1).repartition(100, 
$"id").withColumn("val", rand()).repartition(100).map { 
  row => if (TaskContext.get.stageAttemptNumber == 0 && 
TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 98) {
throw new Exception("pkill -f java".!!)
  }
  TestObject(row.getLong(0), row.getDouble(1))
}

ds.toDF("id", "value").write.saveAsTable("tmp.test_table")

spark.sql("select count(distinct id) from tmp.test_table").show{code}
Command: 
{code:java}
spark-shell --num-executors 10 (--conf spark.dynamicAllocation.enabled=false 
--conf spark.shuffle.service.enabled=false){code}
To simulate the issue, disable external shuffle service is needed (if it's

[jira] [Resolved] (SPARK-38389) Add the `DATEDIFF` and `DATE_DIFF` aliases for `TIMESTAMPDIFF()`

2022-03-02 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38389.
--
Resolution: Fixed

Issue resolved by pull request 35709
[https://github.com/apache/spark/pull/35709]

> Add the `DATEDIFF` and `DATE_DIFF` aliases for `TIMESTAMPDIFF()`
> 
>
> Key: SPARK-38389
> URL: https://issues.apache.org/jira/browse/SPARK-38389
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>
> Introduce the datediff()/date_diff() function, which takes three arguments: 
> unit, and two datetime expressions, i.e.,
> {code:sql}
> datediff(unit, startDatetime, endDatetime)
> {code}
> The function can be an alias to timestampdiff().



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33206) Spark Shuffle Index Cache calculates memory usage wrong

2022-03-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500272#comment-17500272
 ] 

Apache Spark commented on SPARK-33206:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/35714

> Spark Shuffle Index Cache calculates memory usage wrong
> ---
>
> Key: SPARK-33206
> URL: https://issues.apache.org/jira/browse/SPARK-33206
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.0, 3.0.1
>Reporter: Lars Francke
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image001(1).png
>
>
> SPARK-21501 changed the spark shuffle index service to be based on memory 
> instead of the number of files.
> Unfortunately, there's a problem with the calculation which is based on size 
> information provided by `ShuffleIndexInformation`.
> It is based purely on the file size of the cached file on disk.
> We're running in OOMs with very small index files (byte size ~16 bytes) but 
> the overhead of the ShuffleIndexInformation around this is much larger (e.g. 
> 184 bytes, see screenshot). We need to take this into account and should 
> probably add a fixed overhead of somewhere between 152 and 180 bytes 
> according to my tests. I'm not 100% sure what the correct number is and it'll 
> also depend on the architecture etc. so we can't be exact anyway.
> If we do that we can maybe get rid of the size field in 
> ShuffleIndexInformation to save a few more bytes per entry.
> In effect this means that for small files we use up about 70-100 times as 
> much memory as we intend to. Our NodeManagers OOM with 4GB and more of 
> indexShuffleCache.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33206) Spark Shuffle Index Cache calculates memory usage wrong

2022-03-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500271#comment-17500271
 ] 

Apache Spark commented on SPARK-33206:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/35714

> Spark Shuffle Index Cache calculates memory usage wrong
> ---
>
> Key: SPARK-33206
> URL: https://issues.apache.org/jira/browse/SPARK-33206
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.0, 3.0.1
>Reporter: Lars Francke
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image001(1).png
>
>
> SPARK-21501 changed the spark shuffle index service to be based on memory 
> instead of the number of files.
> Unfortunately, there's a problem with the calculation which is based on size 
> information provided by `ShuffleIndexInformation`.
> It is based purely on the file size of the cached file on disk.
> We're running in OOMs with very small index files (byte size ~16 bytes) but 
> the overhead of the ShuffleIndexInformation around this is much larger (e.g. 
> 184 bytes, see screenshot). We need to take this into account and should 
> probably add a fixed overhead of somewhere between 152 and 180 bytes 
> according to my tests. I'm not 100% sure what the correct number is and it'll 
> also depend on the architecture etc. so we can't be exact anyway.
> If we do that we can maybe get rid of the size field in 
> ShuffleIndexInformation to save a few more bytes per entry.
> In effect this means that for small files we use up about 70-100 times as 
> much memory as we intend to. Our NodeManagers OOM with 4GB and more of 
> indexShuffleCache.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38394) build of spark sql against hadoop-3.4.0-snapshot failing with bouncycastle classpath error

2022-03-02 Thread Steve Loughran (Jira)

Steve Loughran created SPARK-38394:
--

 Summary: build of spark sql against hadoop-3.4.0-snapshot failing 
with bouncycastle classpath error
 Key: SPARK-38394
 URL: https://issues.apache.org/jira/browse/SPARK-38394
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.3.0
Reporter: Steve Loughran


builidng spark master with {{-Dhadoop.version=3.4.0-SNAPSHOT}} and a local 
hadoop build breaks in the sbt compiler plugin


{code}
[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:4.3.0:testCompile 
(scala-test-compile-first) on project spark-sql_2.12: Execution 
scala-test-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:4.3.0:testCompile failed: A required 
class was missing while executing 
net.alchim31.maven:scala-maven-plugin:4.3.0:testCompile: 
org/bouncycastle/jce/provider/BouncyCastleProvider
[ERROR] -
[ERROR] realm =plugin>net.alchim31.maven:scala-maven-plugin:4.3.0

{code}

* this is the classpath of the sbt compiler
* hadoop hasn't been doing anything related to bouncy castle.

setting scala-maven-plugin to 3.4.0 makes this go away, i.e. reapplying 
SPARK-36547

the implication here is that the plugin version is going to have to be 
configured in different profiles.






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38393) Clean up deprecated usage of GenSeq/GenMap

2022-03-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500207#comment-17500207
 ] 

Apache Spark commented on SPARK-38393:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35713

> Clean up deprecated usage of GenSeq/GenMap
> --
>
> Key: SPARK-38393
> URL: https://issues.apache.org/jira/browse/SPARK-38393
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> GenSeq/GenMap  is identified as @deprecated since Scala 2.13.0 and Gen* 
> collection types have been removed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38393) Clean up deprecated usage of GenSeq/GenMap

2022-03-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38393:


Assignee: Apache Spark

> Clean up deprecated usage of GenSeq/GenMap
> --
>
> Key: SPARK-38393
> URL: https://issues.apache.org/jira/browse/SPARK-38393
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> GenSeq/GenMap  is identified as @deprecated since Scala 2.13.0 and Gen* 
> collection types have been removed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38393) Clean up deprecated usage of GenSeq/GenMap

2022-03-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38393:


Assignee: (was: Apache Spark)

> Clean up deprecated usage of GenSeq/GenMap
> --
>
> Key: SPARK-38393
> URL: https://issues.apache.org/jira/browse/SPARK-38393
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> GenSeq/GenMap  is identified as @deprecated since Scala 2.13.0 and Gen* 
> collection types have been removed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38393) Clean up deprecated usage of GenSeq/GenMap

2022-03-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500206#comment-17500206
 ] 

Apache Spark commented on SPARK-38393:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35713

> Clean up deprecated usage of GenSeq/GenMap
> --
>
> Key: SPARK-38393
> URL: https://issues.apache.org/jira/browse/SPARK-38393
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> GenSeq/GenMap  is identified as @deprecated since Scala 2.13.0 and Gen* 
> collection types have been removed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38393) Clean up deprecated usage of GenSeq/GenMap

2022-03-02 Thread Yang Jie (Jira)

Yang Jie created SPARK-38393:


 Summary: Clean up deprecated usage of GenSeq/GenMap
 Key: SPARK-38393
 URL: https://issues.apache.org/jira/browse/SPARK-38393
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Yang Jie


GenSeq/GenMap  is identified as @deprecated since Scala 2.13.0 and Gen* 
collection types have been removed.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38342) Clean up deprecated api usage of Ivy

2022-03-02 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-38342.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35672
[https://github.com/apache/spark/pull/35672]

> Clean up deprecated api usage of Ivy
> 
>
> Key: SPARK-38342
> URL: https://issues.apache.org/jira/browse/SPARK-38342
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>
> {code:java}
> [WARNING] [Warn] 
> /spark-source/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:1459:
>  [deprecation @ 
> org.apache.spark.deploy.SparkSubmitUtils.resolveMavenCoordinates | 
> origin=org.apache.ivy.Ivy.retrieve | version=] method retrieve in class Ivy 
> is deprecated {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38342) Clean up deprecated api usage of Ivy

2022-03-02 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-38342:


Assignee: Yang Jie

> Clean up deprecated api usage of Ivy
> 
>
> Key: SPARK-38342
> URL: https://issues.apache.org/jira/browse/SPARK-38342
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> {code:java}
> [WARNING] [Warn] 
> /spark-source/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala:1459:
>  [deprecation @ 
> org.apache.spark.deploy.SparkSubmitUtils.resolveMavenCoordinates | 
> origin=org.apache.ivy.Ivy.retrieve | version=] method retrieve in class Ivy 
> is deprecated {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38392) [K8S] Improve the names used for K8S namespace and driver pod in integration tests

2022-03-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38392:


Assignee: (was: Apache Spark)

> [K8S] Improve the names used for K8S namespace and driver pod in integration 
> tests
> --
>
> Key: SPARK-38392
> URL: https://issues.apache.org/jira/browse/SPARK-38392
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Martin Tzvetanov Grigorov
>Priority: Minor
>
> Currently when there is no configured K8S namespace by the user the 
> Kubernetes IT tests use UUID.toString() without the '-'es:
> {code:java}
> val namespace = 
> namespaceOption.getOrElse(UUID.randomUUID().toString.replaceAll("-", "")) 
> {code}
>  
> Proposal: prefix the temporary namespace with "spark-" or "spark-test-".
>  
> The name of the driver pod is:
> {code:java}
> driverPodName = "spark-test-app-" + 
> UUID.randomUUID().toString.replaceAll("-", "") {code}
> i.e. it does not mention that it is the driver.
> For non-test it uses names like `spark-on-k8s-app-f6bfc57f4a938abe-driver`, 
> i.e. the value of "–name" config + hash + "-driver".
> In both IT tests and non-test the executor pods always mention "-exec-", so 
> they are clear.
>  
> Proposal: unify non-test and IT test name and use "spark-test-app-"+ hash + 
> "-driver" for the IT tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 119 matches

Mail list logo