date:20220412

[jira] [Updated] (SPARK-38857) test_mode test failed due to 1.4.1-1.4.3 bug

2022-04-12 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-38857:

Component/s: (was: Tests)

> test_mode test failed due to 1.4.1-1.4.3 bug
> 
>
> Key: SPARK-38857
> URL: https://issues.apache.org/jira/browse/SPARK-38857
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> [https://github.com/pandas-dev/pandas/issues/46737]
> We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a 
> issue.
>  
> update:
> Pandas community confirmed it's a unexpected but right changes, so series 
> name should be preserved in pser.mode().



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38857) series name should be preserved in pser.mode()

2022-04-12 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-38857:

Summary: series name should be preserved in pser.mode()  (was: test_mode 
test failed due to 1.4.1-1.4.3 bug)

> series name should be preserved in pser.mode()
> --
>
> Key: SPARK-38857
> URL: https://issues.apache.org/jira/browse/SPARK-38857
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> [https://github.com/pandas-dev/pandas/issues/46737]
> We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a 
> issue.
>  
> update:
> Pandas community confirmed it's a unexpected but right changes, so series 
> name should be preserved in pser.mode().



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38857) series name should be preserved in series.mode()

2022-04-12 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-38857:

Summary: series name should be preserved in series.mode()  (was: series 
name should be preserved in pser.mode())

> series name should be preserved in series.mode()
> 
>
> Key: SPARK-38857
> URL: https://issues.apache.org/jira/browse/SPARK-38857
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> [https://github.com/pandas-dev/pandas/issues/46737]
> We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a 
> issue.
>  
> update:
> Pandas community confirmed it's a unexpected but right changes, so series 
> name should be preserved in pser.mode().



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38857) test_mode test failed due to 1.4.1-1.4.3 bug

2022-04-12 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-38857:

Description: 
[https://github.com/pandas-dev/pandas/issues/46737]

We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a 
issue.

 

update:

Pandas community confirm it's a unexpected but right changes, so series name 
should be preserved in pser.mode().

  was:
[https://github.com/pandas-dev/pandas/issues/46737]

We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a 
issue.


> test_mode test failed due to 1.4.1-1.4.3 bug
> 
>
> Key: SPARK-38857
> URL: https://issues.apache.org/jira/browse/SPARK-38857
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> [https://github.com/pandas-dev/pandas/issues/46737]
> We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a 
> issue.
>  
> update:
> Pandas community confirm it's a unexpected but right changes, so series name 
> should be preserved in pser.mode().



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38857) test_mode test failed due to 1.4.1-1.4.3 bug

2022-04-12 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-38857:

Description: 
[https://github.com/pandas-dev/pandas/issues/46737]

We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a 
issue.

 

update:

Pandas community confirmed it's a unexpected but right changes, so series name 
should be preserved in pser.mode().

  was:
[https://github.com/pandas-dev/pandas/issues/46737]

We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a 
issue.

 

update:

Pandas community confirm it's a unexpected but right changes, so series name 
should be preserved in pser.mode().


> test_mode test failed due to 1.4.1-1.4.3 bug
> 
>
> Key: SPARK-38857
> URL: https://issues.apache.org/jira/browse/SPARK-38857
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> [https://github.com/pandas-dev/pandas/issues/46737]
> We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a 
> issue.
>  
> update:
> Pandas community confirmed it's a unexpected but right changes, so series 
> name should be preserved in pser.mode().



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38725) Test the error class: DUPLICATE_KEY

2022-04-12 Thread panbingkun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521003#comment-17521003
 ] 

panbingkun commented on SPARK-38725:


I am working on this. Thanks [~maxgekk] 

> Test the error class: DUPLICATE_KEY
> ---
>
> Key: SPARK-38725
> URL: https://issues.apache.org/jira/browse/SPARK-38725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *DUPLICATE_KEY* to 
> QueryParsingErrorsSuite. The test should cover the exception throw in 
> QueryParsingErrors:
> {code:scala}
>   def duplicateKeysError(key: String, ctx: ParserRuleContext): Throwable = {
> // Found duplicate keys '$key'
> new ParseException(errorClass = "DUPLICATE_KEY", messageParameters = 
> Array(key), ctx)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38868) `assert_true` fails unconditionnaly after `left_outer` joins

2022-04-12 Thread Fabien Dubosson (Jira)

Fabien Dubosson created SPARK-38868:
---

 Summary: `assert_true` fails unconditionnaly after `left_outer` 
joins
 Key: SPARK-38868
 URL: https://issues.apache.org/jira/browse/SPARK-38868
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.1, 3.2.0, 3.1.2, 3.1.1
Reporter: Fabien Dubosson


When `assert_true` is used after a `left_outer` join the assert exception is 
raised even though all the rows meet the condition. Using an `inner` join does 
not expose this issue.

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import functions as sf

session = SparkSession.builder.getOrCreate()

entries = session.createDataFrame(
    [
        ("a", 1),
        ("b", 2),
        ("c", 3),
    ],
    ["id", "outcome_id"],
)

outcomes = session.createDataFrame(
    [
        (1, 12),
        (2, 34),
        (3, 32),
    ],
    ["outcome_id", "outcome_value"],
)

# Inner join works as expected
(
    entries.join(outcomes, on="outcome_id", how="inner")
    .withColumn("valid", sf.assert_true(sf.col("outcome_value") > 10))
    .filter(sf.col("valid").isNull())
    .show()
)

# Left join fails with «'('outcome_value > 10)' is not true!» even though it is 
the case
(
    entries.join(outcomes, on="outcome_id", how="left_outer")
    .withColumn("valid", sf.assert_true(sf.col("outcome_value") > 10))
    .filter(sf.col("valid").isNull())
    .show()
){code}
Reproduced on `pyspark` versions: `3.2.1`, `3.2.0`, `3.1.2` and `3.1.1`. I am 
not sure if "native" Spark exposes this issue as well or not, I don't have the 
knowledge/setup to test that.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38857) series name should be preserved in series.mode()

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521017#comment-17521017
 ] 

Apache Spark commented on SPARK-38857:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36159

> series name should be preserved in series.mode()
> 
>
> Key: SPARK-38857
> URL: https://issues.apache.org/jira/browse/SPARK-38857
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> [https://github.com/pandas-dev/pandas/issues/46737]
> We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a 
> issue.
>  
> update:
> Pandas community confirmed it's a unexpected but right changes, so series 
> name should be preserved in pser.mode().



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38857) series name should be preserved in series.mode()

2022-04-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38857:


Assignee: (was: Apache Spark)

> series name should be preserved in series.mode()
> 
>
> Key: SPARK-38857
> URL: https://issues.apache.org/jira/browse/SPARK-38857
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> [https://github.com/pandas-dev/pandas/issues/46737]
> We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a 
> issue.
>  
> update:
> Pandas community confirmed it's a unexpected but right changes, so series 
> name should be preserved in pser.mode().



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38857) series name should be preserved in series.mode()

2022-04-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38857:


Assignee: Apache Spark

> series name should be preserved in series.mode()
> 
>
> Key: SPARK-38857
> URL: https://issues.apache.org/jira/browse/SPARK-38857
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>
> [https://github.com/pandas-dev/pandas/issues/46737]
> We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a 
> issue.
>  
> update:
> Pandas community confirmed it's a unexpected but right changes, so series 
> name should be preserved in pser.mode().



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38857) series name should be preserved in series.mode()

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521018#comment-17521018
 ] 

Apache Spark commented on SPARK-38857:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36159

> series name should be preserved in series.mode()
> 
>
> Key: SPARK-38857
> URL: https://issues.apache.org/jira/browse/SPARK-38857
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> [https://github.com/pandas-dev/pandas/issues/46737]
> We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a 
> issue.
>  
> update:
> Pandas community confirmed it's a unexpected but right changes, so series 
> name should be preserved in pser.mode().



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38821) test_nsmallest test failed tue to pandas 1.4.0-1.4.2 bug

2022-04-12 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-38821:

Summary: test_nsmallest test failed tue to pandas 1.4.0-1.4.2 bug  (was: 
test_nsmallest test failed tue to pandas 1.4.1/1.4.2 bug)

> test_nsmallest test failed tue to pandas 1.4.0-1.4.2 bug
> 
>
> Key: SPARK-38821
> URL: https://issues.apache.org/jira/browse/SPARK-38821
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Priority: Major
>
> [https://github.com/apache/spark/blob/becda3339381b3975ed567c156260eda036d7a1b/python/pyspark/pandas/tests/test_dataframe.py#L1829]
>  
> After [https://github.com/pandas-dev/pandas/issues/46589] fixed, we need skip 
> L1829 from v1.4.0 to v1.4.x



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37935) Migrate onto error classes

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521024#comment-17521024
 ] 

Apache Spark commented on SPARK-37935:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36160

> Migrate onto error classes
> --
>
> Key: SPARK-37935
> URL: https://issues.apache.org/jira/browse/SPARK-37935
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The PR https://github.com/apache/spark/pull/32850 introduced error classes as 
> a part of the error messages framework 
> (https://issues.apache.org/jira/browse/SPARK-33539). Need to migrate all 
> exceptions from QueryExecutionErrors, QueryCompilationErrors and 
> QueryParsingErrors on the error classes using instances of SparkThrowable, 
> and carefully test every error class by writing tests in dedicated test 
> suites:
> *  QueryExecutionErrorsSuite for the errors that are occurred during query 
> execution
> * QueryCompilationErrorsSuite ... query compilation or eagerly executing 
> commands
> * QueryParsingErrorsSuite ... parsing errors
> Here is an example https://github.com/apache/spark/pull/35157 of how an 
> existing Java exception can be replaced, and testing of related error 
> classes.At the end, we should migrate all exceptions from the files 
> Query.*Errors.scala and cover all error classes from the error-classes.json 
> file by tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38725) Test the error class: DUPLICATE_KEY

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521026#comment-17521026
 ] 

Apache Spark commented on SPARK-38725:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36160

> Test the error class: DUPLICATE_KEY
> ---
>
> Key: SPARK-38725
> URL: https://issues.apache.org/jira/browse/SPARK-38725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *DUPLICATE_KEY* to 
> QueryParsingErrorsSuite. The test should cover the exception throw in 
> QueryParsingErrors:
> {code:scala}
>   def duplicateKeysError(key: String, ctx: ParserRuleContext): Throwable = {
> // Found duplicate keys '$key'
> new ParseException(errorClass = "DUPLICATE_KEY", messageParameters = 
> Array(key), ctx)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38725) Test the error class: DUPLICATE_KEY

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521027#comment-17521027
 ] 

Apache Spark commented on SPARK-38725:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36160

> Test the error class: DUPLICATE_KEY
> ---
>
> Key: SPARK-38725
> URL: https://issues.apache.org/jira/browse/SPARK-38725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *DUPLICATE_KEY* to 
> QueryParsingErrorsSuite. The test should cover the exception throw in 
> QueryParsingErrors:
> {code:scala}
>   def duplicateKeysError(key: String, ctx: ParserRuleContext): Throwable = {
> // Found duplicate keys '$key'
> new ParseException(errorClass = "DUPLICATE_KEY", messageParameters = 
> Array(key), ctx)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38725) Test the error class: DUPLICATE_KEY

2022-04-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38725:


Assignee: Apache Spark

> Test the error class: DUPLICATE_KEY
> ---
>
> Key: SPARK-38725
> URL: https://issues.apache.org/jira/browse/SPARK-38725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *DUPLICATE_KEY* to 
> QueryParsingErrorsSuite. The test should cover the exception throw in 
> QueryParsingErrors:
> {code:scala}
>   def duplicateKeysError(key: String, ctx: ParserRuleContext): Throwable = {
> // Found duplicate keys '$key'
> new ParseException(errorClass = "DUPLICATE_KEY", messageParameters = 
> Array(key), ctx)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37935) Migrate onto error classes

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521025#comment-17521025
 ] 

Apache Spark commented on SPARK-37935:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36160

> Migrate onto error classes
> --
>
> Key: SPARK-37935
> URL: https://issues.apache.org/jira/browse/SPARK-37935
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The PR https://github.com/apache/spark/pull/32850 introduced error classes as 
> a part of the error messages framework 
> (https://issues.apache.org/jira/browse/SPARK-33539). Need to migrate all 
> exceptions from QueryExecutionErrors, QueryCompilationErrors and 
> QueryParsingErrors on the error classes using instances of SparkThrowable, 
> and carefully test every error class by writing tests in dedicated test 
> suites:
> *  QueryExecutionErrorsSuite for the errors that are occurred during query 
> execution
> * QueryCompilationErrorsSuite ... query compilation or eagerly executing 
> commands
> * QueryParsingErrorsSuite ... parsing errors
> Here is an example https://github.com/apache/spark/pull/35157 of how an 
> existing Java exception can be replaced, and testing of related error 
> classes.At the end, we should migrate all exceptions from the files 
> Query.*Errors.scala and cover all error classes from the error-classes.json 
> file by tests.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38725) Test the error class: DUPLICATE_KEY

2022-04-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38725:


Assignee: (was: Apache Spark)

> Test the error class: DUPLICATE_KEY
> ---
>
> Key: SPARK-38725
> URL: https://issues.apache.org/jira/browse/SPARK-38725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *DUPLICATE_KEY* to 
> QueryParsingErrorsSuite. The test should cover the exception throw in 
> QueryParsingErrors:
> {code:scala}
>   def duplicateKeysError(key: String, ctx: ParserRuleContext): Throwable = {
> // Found duplicate keys '$key'
> new ParseException(errorClass = "DUPLICATE_KEY", messageParameters = 
> Array(key), ctx)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38869) Respect Table capability `ACCEPT_ANY_SCHEMA` in default column resolution

2022-04-12 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-38869:
--

 Summary: Respect Table capability `ACCEPT_ANY_SCHEMA` in default 
column resolution
 Key: SPARK-38869
 URL: https://issues.apache.org/jira/browse/SPARK-38869
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Gengliang Wang
Assignee: Daniel


If a V2 table has the capability of 
[ACCEPT_ANY_SCHEMA|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCapability.java#L94],
 we should skip adding default column values to the insert schema.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38870) SparkSession.builder returns a new builder in Scala, but not in Python

2022-04-12 Thread Furcy Pin (Jira)

Furcy Pin created SPARK-38870:
-

 Summary: SparkSession.builder returns a new builder in Scala, but 
not in Python
 Key: SPARK-38870
 URL: https://issues.apache.org/jira/browse/SPARK-38870
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.2.1
Reporter: Furcy Pin


In pyspark, _SparkSession.builder_ always returns the same static builder, 
while the expected behaviour should be the same as in Scala, where it returns a 
new builder each time.


*How to reproduce*

When we run the following code in Scala :
{code:java}
import org.apache.spark.sql.SparkSession

val s1 = SparkSession.builder.master("local[2]").config("key", 
"value").getOrCreate()
println("A : " + s1.conf.get("key")) // value
s1.conf.set("key", "new_value")
println("B : " + s1.conf.get("key")) // new_value

val s2 = SparkSession.builder.getOrCreate()
println("C : " + s1.conf.get("key")) // new_value{code}
The output is :
{code:java}
A : value
B : new_value
C : new_value   <<<{code}
 

 

But when we run the following (supposedly equivalent) code in Python:
{code:java}
from pyspark.sql import SparkSession

s1 = SparkSession.builder.master("local[2]").config("key", 
"value").getOrCreate()
print("A : " + s1.conf.get("key"))
s1.conf.set("key", "new_value")
print("B : " + s1.conf.get("key"))

s2 = SparkSession.builder.getOrCreate()
print("C : " + s1.conf.get("key")){code}
The output is : 

 
{code:java}
A : value
B : new_value
C : value  <<<
{code}
 

 

 

*Root cause analysis*



This comes from the fact that _SparkSession.builder_ behaves differently in 
Python than in Scala. In Scala, it returns a *new builder* each time, in Python 
it returns *the same builder* every time, and the SparkSession.Builder._options 
are static, too.


Because of this, whenever _SparkSession.builder.getOrCreate()_ is called, the 
options passed to the very first builder are re-applied every time, and 
overrides the option that were set afterwards. 
This leads to very awkward behavior in every Spark version up to 3.2.1 included

{*}Example{*}:

This example crashes, but was fixed by SPARK-37638

 
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", 
"DYNAMIC").getOrCreate()

assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "DYNAMIC" 
# OK

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")
assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # 
OK

from pyspark.sql import functions as f
from pyspark.sql.types import StringType
f.col("a").cast(StringType()) 

assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
# This fails in all versions until the SPARK-37638 fix
# because before that fix, Column.cast() calle 
SparkSession.builder.getOrCreate(){code}
 

 

But this example still crashes in the current version on the master branch
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", 
"DYNAMIC").getOrCreate()

assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "DYNAMIC" 
# OK

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")
assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # 
OK

SparkSession.builder.getOrCreate() 

assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
# This assert fails in master{code}
 

I will make a Pull Request to fix this bug shortly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38870) SparkSession.builder returns a new builder in Scala, but not in Python

2022-04-12 Thread Furcy Pin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Furcy Pin updated SPARK-38870:
--
Description: 
In pyspark, _SparkSession.builder_ always returns the same static builder, 
while the expected behaviour should be the same as in Scala, where it returns a 
new builder each time.

*How to reproduce*

When we run the following code in Scala :
{code:java}
import org.apache.spark.sql.SparkSession

val s1 = SparkSession.builder.master("local[2]").config("key", 
"value").getOrCreate()
println("A : " + s1.conf.get("key")) // value
s1.conf.set("key", "new_value")
println("B : " + s1.conf.get("key")) // new_value

val s2 = SparkSession.builder.getOrCreate()
println("C : " + s1.conf.get("key")) // new_value{code}
The output is :
{code:java}
A : value
B : new_value
C : new_value   <<<{code}
 

But when we run the following (supposedly equivalent) code in Python:
{code:java}
from pyspark.sql import SparkSession

s1 = SparkSession.builder.master("local[2]").config("key", 
"value").getOrCreate()
print("A : " + s1.conf.get("key"))
s1.conf.set("key", "new_value")
print("B : " + s1.conf.get("key"))

s2 = SparkSession.builder.getOrCreate()
print("C : " + s1.conf.get("key")){code}
The output is : 
{code:java}
A : value
B : new_value
C : value  <<<
{code}
 

 

*Root cause analysis*

This comes from the fact that _SparkSession.builder_ behaves differently in 
Python than in Scala. In Scala, it returns a *new builder* each time, in Python 
it returns *the same builder* every time, and the SparkSession.Builder._options 
are static, too.

Because of this, whenever _SparkSession.builder.getOrCreate()_ is called, the 
options passed to the very first builder are re-applied every time, and 
overrides the option that were set afterwards. 
This leads to very awkward behavior in every Spark version up to 3.2.1 included

{*}Example{*}:

This example crashes, but was fixed by SPARK-37638

 
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", 
"DYNAMIC").getOrCreate()

assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "DYNAMIC" 
# OK

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")
assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # 
OK

from pyspark.sql import functions as f
from pyspark.sql.types import StringType
f.col("a").cast(StringType()) 

assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
# This fails in all versions until the SPARK-37638 fix
# because before that fix, Column.cast() calle 
SparkSession.builder.getOrCreate(){code}
 

But this example still crashes in the current version on the master branch
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", 
"DYNAMIC").getOrCreate()

assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "DYNAMIC" 
# OK

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")
assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # 
OK

SparkSession.builder.getOrCreate() 

assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
# This assert fails in master{code}
 

I will make a Pull Request to fix this bug shortly.

  was:
In pyspark, _SparkSession.builder_ always returns the same static builder, 
while the expected behaviour should be the same as in Scala, where it returns a 
new builder each time.


*How to reproduce*

When we run the following code in Scala :
{code:java}
import org.apache.spark.sql.SparkSession

val s1 = SparkSession.builder.master("local[2]").config("key", 
"value").getOrCreate()
println("A : " + s1.conf.get("key")) // value
s1.conf.set("key", "new_value")
println("B : " + s1.conf.get("key")) // new_value

val s2 = SparkSession.builder.getOrCreate()
println("C : " + s1.conf.get("key")) // new_value{code}
The output is :
{code:java}
A : value
B : new_value
C : new_value   <<<{code}
 

 

But when we run the following (supposedly equivalent) code in Python:
{code:java}
from pyspark.sql import SparkSession

s1 = SparkSession.builder.master("local[2]").config("key", 
"value").getOrCreate()
print("A : " + s1.conf.get("key"))
s1.conf.set("key", "new_value")
print("B : " + s1.conf.get("key"))

s2 = SparkSession.builder.getOrCreate()
print("C : " + s1.conf.get("key")){code}
The output is : 

 
{code:java}
A : value
B : new_value
C : value  <<<
{code}
 

 

 

*Root cause analysis*



This comes from the fact that _SparkSession.builder_ behaves differently in 
Python than in Scala. In Scala, it returns a *new builder* each time, in Python 
it returns *the same builder* every time, and the SparkSession.Builder._options 
are static, too.


Because of this, whenever _SparkSession.builder.getOrCreate()_ is called, the 
options

[jira] [Updated] (SPARK-38870) SparkSession.builder returns a new builder in Scala, but not in Python

2022-04-12 Thread Furcy Pin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Furcy Pin updated SPARK-38870:
--
Description: 
In pyspark, _SparkSession.builder_ always returns the same static builder, 
while the expected behaviour should be the same as in Scala, where it returns a 
new builder each time.

*How to reproduce*

When we run the following code in Scala :
{code:java}
import org.apache.spark.sql.SparkSession

val s1 = SparkSession.builder.master("local[2]").config("key", 
"value").getOrCreate()
println("A : " + s1.conf.get("key")) // value
s1.conf.set("key", "new_value")
println("B : " + s1.conf.get("key")) // new_value

val s2 = SparkSession.builder.getOrCreate()
println("C : " + s1.conf.get("key")) // new_value{code}
The output is :
{code:java}
A : value
B : new_value
C : new_value   <<<{code}
 

But when we run the following (supposedly equivalent) code in Python:
{code:java}
from pyspark.sql import SparkSession

s1 = SparkSession.builder.master("local[2]").config("key", 
"value").getOrCreate()
print("A : " + s1.conf.get("key"))
s1.conf.set("key", "new_value")
print("B : " + s1.conf.get("key"))

s2 = SparkSession.builder.getOrCreate()
print("C : " + s1.conf.get("key")){code}
The output is : 
{code:java}
A : value
B : new_value
C : value  <<<
{code}
 

 

*Root cause analysis*

This comes from the fact that _SparkSession.builder_ behaves differently in 
Python than in Scala. In Scala, it returns a *new builder* each time, in Python 
it returns *the same builder* every time, and the SparkSession.Builder._options 
are static, too.

Because of this, whenever _SparkSession.builder.getOrCreate()_ is called, the 
options passed to the very first builder are re-applied every time, and 
overrides the option that were set afterwards. 
This leads to very awkward behavior in every Spark version up to 3.2.1 included

{*}Example{*}:

This example crashes, but was fixed by SPARK-37638

 
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", 
"DYNAMIC").getOrCreate()

assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "DYNAMIC" 
# OK

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")
assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # 
OK

from pyspark.sql import functions as f
from pyspark.sql.types import StringType
f.col("a").cast(StringType()) 

assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
# This fails in all versions until the SPARK-37638 fix
# because before that fix, Column.cast() calle 
SparkSession.builder.getOrCreate(){code}
 

But this example still crashes in the current version on the master branch
{code:java}
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", 
"DYNAMIC").getOrCreate()

assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "DYNAMIC" 
# OK

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")
assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # 
OK

SparkSession.builder.getOrCreate() 

assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
# This assert fails in master{code}
 

I made a Pull Request to fix this bug : 
https://github.com/apache/spark/pull/36161

  was:
In pyspark, _SparkSession.builder_ always returns the same static builder, 
while the expected behaviour should be the same as in Scala, where it returns a 
new builder each time.

*How to reproduce*

When we run the following code in Scala :
{code:java}
import org.apache.spark.sql.SparkSession

val s1 = SparkSession.builder.master("local[2]").config("key", 
"value").getOrCreate()
println("A : " + s1.conf.get("key")) // value
s1.conf.set("key", "new_value")
println("B : " + s1.conf.get("key")) // new_value

val s2 = SparkSession.builder.getOrCreate()
println("C : " + s1.conf.get("key")) // new_value{code}
The output is :
{code:java}
A : value
B : new_value
C : new_value   <<<{code}
 

But when we run the following (supposedly equivalent) code in Python:
{code:java}
from pyspark.sql import SparkSession

s1 = SparkSession.builder.master("local[2]").config("key", 
"value").getOrCreate()
print("A : " + s1.conf.get("key"))
s1.conf.set("key", "new_value")
print("B : " + s1.conf.get("key"))

s2 = SparkSession.builder.getOrCreate()
print("C : " + s1.conf.get("key")){code}
The output is : 
{code:java}
A : value
B : new_value
C : value  <<<
{code}
 

 

*Root cause analysis*

This comes from the fact that _SparkSession.builder_ behaves differently in 
Python than in Scala. In Scala, it returns a *new builder* each time, in Python 
it returns *the same builder* every time, and the SparkSession.Builder._options 
are static, too.

Because of this, whenever _SparkSession.builder.getOrCreate()_ is ca

[jira] [Commented] (SPARK-38870) SparkSession.builder returns a new builder in Scala, but not in Python

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521064#comment-17521064
 ] 

Apache Spark commented on SPARK-38870:
--

User 'FurcyPin' has created a pull request for this issue:
https://github.com/apache/spark/pull/36161

> SparkSession.builder returns a new builder in Scala, but not in Python
> --
>
> Key: SPARK-38870
> URL: https://issues.apache.org/jira/browse/SPARK-38870
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.2.1
>Reporter: Furcy Pin
>Priority: Major
>
> In pyspark, _SparkSession.builder_ always returns the same static builder, 
> while the expected behaviour should be the same as in Scala, where it returns 
> a new builder each time.
> *How to reproduce*
> When we run the following code in Scala :
> {code:java}
> import org.apache.spark.sql.SparkSession
> val s1 = SparkSession.builder.master("local[2]").config("key", 
> "value").getOrCreate()
> println("A : " + s1.conf.get("key")) // value
> s1.conf.set("key", "new_value")
> println("B : " + s1.conf.get("key")) // new_value
> val s2 = SparkSession.builder.getOrCreate()
> println("C : " + s1.conf.get("key")) // new_value{code}
> The output is :
> {code:java}
> A : value
> B : new_value
> C : new_value   <<<{code}
>  
> But when we run the following (supposedly equivalent) code in Python:
> {code:java}
> from pyspark.sql import SparkSession
> s1 = SparkSession.builder.master("local[2]").config("key", 
> "value").getOrCreate()
> print("A : " + s1.conf.get("key"))
> s1.conf.set("key", "new_value")
> print("B : " + s1.conf.get("key"))
> s2 = SparkSession.builder.getOrCreate()
> print("C : " + s1.conf.get("key")){code}
> The output is : 
> {code:java}
> A : value
> B : new_value
> C : value  <<<
> {code}
>  
>  
> *Root cause analysis*
> This comes from the fact that _SparkSession.builder_ behaves differently in 
> Python than in Scala. In Scala, it returns a *new builder* each time, in 
> Python it returns *the same builder* every time, and the 
> SparkSession.Builder._options are static, too.
> Because of this, whenever _SparkSession.builder.getOrCreate()_ is called, the 
> options passed to the very first builder are re-applied every time, and 
> overrides the option that were set afterwards. 
> This leads to very awkward behavior in every Spark version up to 3.2.1 
> included
> {*}Example{*}:
> This example crashes, but was fixed by SPARK-37638
>  
> {code:java}
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", 
> "DYNAMIC").getOrCreate()
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == 
> "DYNAMIC" # OK
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # OK
> from pyspark.sql import functions as f
> from pyspark.sql.types import StringType
> f.col("a").cast(StringType()) 
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # This fails in all versions until the SPARK-37638 fix
> # because before that fix, Column.cast() calle 
> SparkSession.builder.getOrCreate(){code}
>  
> But this example still crashes in the current version on the master branch
> {code:java}
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", 
> "DYNAMIC").getOrCreate()
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == 
> "DYNAMIC" # OK
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # OK
> SparkSession.builder.getOrCreate() 
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # This assert fails in master{code}
>  
> I made a Pull Request to fix this bug : 
> https://github.com/apache/spark/pull/36161



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38870) SparkSession.builder returns a new builder in Scala, but not in Python

2022-04-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38870:


Assignee: Apache Spark

> SparkSession.builder returns a new builder in Scala, but not in Python
> --
>
> Key: SPARK-38870
> URL: https://issues.apache.org/jira/browse/SPARK-38870
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.2.1
>Reporter: Furcy Pin
>Assignee: Apache Spark
>Priority: Major
>
> In pyspark, _SparkSession.builder_ always returns the same static builder, 
> while the expected behaviour should be the same as in Scala, where it returns 
> a new builder each time.
> *How to reproduce*
> When we run the following code in Scala :
> {code:java}
> import org.apache.spark.sql.SparkSession
> val s1 = SparkSession.builder.master("local[2]").config("key", 
> "value").getOrCreate()
> println("A : " + s1.conf.get("key")) // value
> s1.conf.set("key", "new_value")
> println("B : " + s1.conf.get("key")) // new_value
> val s2 = SparkSession.builder.getOrCreate()
> println("C : " + s1.conf.get("key")) // new_value{code}
> The output is :
> {code:java}
> A : value
> B : new_value
> C : new_value   <<<{code}
>  
> But when we run the following (supposedly equivalent) code in Python:
> {code:java}
> from pyspark.sql import SparkSession
> s1 = SparkSession.builder.master("local[2]").config("key", 
> "value").getOrCreate()
> print("A : " + s1.conf.get("key"))
> s1.conf.set("key", "new_value")
> print("B : " + s1.conf.get("key"))
> s2 = SparkSession.builder.getOrCreate()
> print("C : " + s1.conf.get("key")){code}
> The output is : 
> {code:java}
> A : value
> B : new_value
> C : value  <<<
> {code}
>  
>  
> *Root cause analysis*
> This comes from the fact that _SparkSession.builder_ behaves differently in 
> Python than in Scala. In Scala, it returns a *new builder* each time, in 
> Python it returns *the same builder* every time, and the 
> SparkSession.Builder._options are static, too.
> Because of this, whenever _SparkSession.builder.getOrCreate()_ is called, the 
> options passed to the very first builder are re-applied every time, and 
> overrides the option that were set afterwards. 
> This leads to very awkward behavior in every Spark version up to 3.2.1 
> included
> {*}Example{*}:
> This example crashes, but was fixed by SPARK-37638
>  
> {code:java}
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", 
> "DYNAMIC").getOrCreate()
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == 
> "DYNAMIC" # OK
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # OK
> from pyspark.sql import functions as f
> from pyspark.sql.types import StringType
> f.col("a").cast(StringType()) 
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # This fails in all versions until the SPARK-37638 fix
> # because before that fix, Column.cast() calle 
> SparkSession.builder.getOrCreate(){code}
>  
> But this example still crashes in the current version on the master branch
> {code:java}
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", 
> "DYNAMIC").getOrCreate()
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == 
> "DYNAMIC" # OK
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # OK
> SparkSession.builder.getOrCreate() 
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # This assert fails in master{code}
>  
> I made a Pull Request to fix this bug : 
> https://github.com/apache/spark/pull/36161



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38870) SparkSession.builder returns a new builder in Scala, but not in Python

2022-04-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38870:


Assignee: (was: Apache Spark)

> SparkSession.builder returns a new builder in Scala, but not in Python
> --
>
> Key: SPARK-38870
> URL: https://issues.apache.org/jira/browse/SPARK-38870
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.2.1
>Reporter: Furcy Pin
>Priority: Major
>
> In pyspark, _SparkSession.builder_ always returns the same static builder, 
> while the expected behaviour should be the same as in Scala, where it returns 
> a new builder each time.
> *How to reproduce*
> When we run the following code in Scala :
> {code:java}
> import org.apache.spark.sql.SparkSession
> val s1 = SparkSession.builder.master("local[2]").config("key", 
> "value").getOrCreate()
> println("A : " + s1.conf.get("key")) // value
> s1.conf.set("key", "new_value")
> println("B : " + s1.conf.get("key")) // new_value
> val s2 = SparkSession.builder.getOrCreate()
> println("C : " + s1.conf.get("key")) // new_value{code}
> The output is :
> {code:java}
> A : value
> B : new_value
> C : new_value   <<<{code}
>  
> But when we run the following (supposedly equivalent) code in Python:
> {code:java}
> from pyspark.sql import SparkSession
> s1 = SparkSession.builder.master("local[2]").config("key", 
> "value").getOrCreate()
> print("A : " + s1.conf.get("key"))
> s1.conf.set("key", "new_value")
> print("B : " + s1.conf.get("key"))
> s2 = SparkSession.builder.getOrCreate()
> print("C : " + s1.conf.get("key")){code}
> The output is : 
> {code:java}
> A : value
> B : new_value
> C : value  <<<
> {code}
>  
>  
> *Root cause analysis*
> This comes from the fact that _SparkSession.builder_ behaves differently in 
> Python than in Scala. In Scala, it returns a *new builder* each time, in 
> Python it returns *the same builder* every time, and the 
> SparkSession.Builder._options are static, too.
> Because of this, whenever _SparkSession.builder.getOrCreate()_ is called, the 
> options passed to the very first builder are re-applied every time, and 
> overrides the option that were set afterwards. 
> This leads to very awkward behavior in every Spark version up to 3.2.1 
> included
> {*}Example{*}:
> This example crashes, but was fixed by SPARK-37638
>  
> {code:java}
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", 
> "DYNAMIC").getOrCreate()
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == 
> "DYNAMIC" # OK
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # OK
> from pyspark.sql import functions as f
> from pyspark.sql.types import StringType
> f.col("a").cast(StringType()) 
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # This fails in all versions until the SPARK-37638 fix
> # because before that fix, Column.cast() calle 
> SparkSession.builder.getOrCreate(){code}
>  
> But this example still crashes in the current version on the master branch
> {code:java}
> from pyspark.sql import SparkSession
> spark = 
> SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", 
> "DYNAMIC").getOrCreate()
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == 
> "DYNAMIC" # OK
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC")
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # OK
> SparkSession.builder.getOrCreate() 
> assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" 
> # This assert fails in master{code}
>  
> I made a Pull Request to fix this bug : 
> https://github.com/apache/spark/pull/36161



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32170) Improve the speculation for the inefficient tasks by the task metrics.

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521084#comment-17521084
 ] 

Apache Spark commented on SPARK-32170:
--

User 'weixiuli' has created a pull request for this issue:
https://github.com/apache/spark/pull/36162

>  Improve the speculation for the inefficient tasks by the task metrics.
> ---
>
> Key: SPARK-32170
> URL: https://issues.apache.org/jira/browse/SPARK-32170
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.0
>Reporter: weixiuli
>Priority: Major
>
> 1) Tasks will be speculated when meet certain conditions no matter they are 
> inefficient or not，this would be a huge waste of cluster resources.
> 2) In production，the speculation task comes  from an efficient one  will be 
> killed finally，which is unnecessary and will waste of cluster resources.
> 3) So, we should  evaluate whether the task is inefficient by success tasks 
> metrics firstly, and then decide to speculate it or not. The  inefficient 
> task will be speculated and efficient one will not, it is better for the 
> cluster resources.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38854) Improve the test coverage for pyspark/statcounter.py

2022-04-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38854.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36145
[https://github.com/apache/spark/pull/36145]

> Improve the test coverage for pyspark/statcounter.py
> 
>
> Key: SPARK-38854
> URL: https://issues.apache.org/jira/browse/SPARK-38854
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>
> Improve the test coverage of statcounter.py 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38589) New SQL function: try_avg

2022-04-12 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-38589.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35896
[https://github.com/apache/spark/pull/35896]

> New SQL function: try_avg
> -
>
> Key: SPARK-38589
> URL: https://issues.apache.org/jira/browse/SPARK-38589
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38873) CLONE - Improve the test coverage for pyspark/mllib module

2022-04-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38873:


 Summary: CLONE - Improve the test coverage for pyspark/mllib module
 Key: SPARK-38873
 URL: https://issues.apache.org/jira/browse/SPARK-38873
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: pralabhkumar


Currently, mllib module has 88% of test coverage.

We could improve the test coverage by adding the missing tests for mllib module.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38871) Improve the test coverage for PySpark/rddsampler.py

2022-04-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38871:


 Summary: Improve the test coverage for PySpark/rddsampler.py
 Key: SPARK-38871
 URL: https://issues.apache.org/jira/browse/SPARK-38871
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark, Tests
Affects Versions: 3.3.0
Reporter: pralabhkumar


Currently, PySpark test coverage is around 91% according to codecov report: 
[https://app.codecov.io/gh/apache/spark|https://app.codecov.io/gh/apache/spark]

Since there are still 9% missing tests, so I think it would be great to improve 
our test coverage.

Of course we might not target to 100%, but as much as possible, to the level 
that we can currently cover with CI.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38877) CLONE - Improve the test coverage for pyspark/find_spark_home.py

2022-04-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38877:


 Summary: CLONE - Improve the test coverage for 
pyspark/find_spark_home.py
 Key: SPARK-38877
 URL: https://issues.apache.org/jira/browse/SPARK-38877
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: pralabhkumar
Assignee: Hyukjin Kwon
 Fix For: 3.4.0


We should test when the environment variables are not set 
(https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/find_spark_home.py)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38875) CLONE - Improve the test coverage for pyspark/sql module

2022-04-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38875:


 Summary: CLONE - Improve the test coverage for pyspark/sql module
 Key: SPARK-38875
 URL: https://issues.apache.org/jira/browse/SPARK-38875
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: pralabhkumar


Currently, sql module has 90% of test coverage.

We could improve the test coverage by adding the missing tests for sql module.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38878) CLONE - Improve the test coverage for pyspark/statcounter.py

2022-04-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38878:


 Summary: CLONE - Improve the test coverage for 
pyspark/statcounter.py
 Key: SPARK-38878
 URL: https://issues.apache.org/jira/browse/SPARK-38878
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: pralabhkumar
Assignee: Hyukjin Kwon
 Fix For: 3.4.0


Improve the test coverage of statcounter.py 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38876) CLONE - Improve the test coverage for pyspark/*.py

2022-04-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38876:


 Summary: CLONE - Improve the test coverage for pyspark/*.py
 Key: SPARK-38876
 URL: https://issues.apache.org/jira/browse/SPARK-38876
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: pralabhkumar


Currently, there are several Python scripts under pyspark/ directory. (e.g. 
rdd.py, util.py, serializers.py, ...)

We could improve the test coverage by adding the missing tests for these 
scripts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38872) CLONE - Improve the test coverage for pyspark/pandas module

2022-04-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38872:


 Summary: CLONE - Improve the test coverage for pyspark/pandas 
module
 Key: SPARK-38872
 URL: https://issues.apache.org/jira/browse/SPARK-38872
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: pralabhkumar


Currently, pandas module (pandas API on Spark) has 94% of test coverage.

We could improve the test coverage by adding the missing tests for pandas 
module.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38871) Improve the test coverage for PySpark/rddsampler.py

2022-04-12 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521134#comment-17521134
 ] 

pralabhkumar commented on SPARK-38871:
--

Please close this one , wrongly cloned 

> Improve the test coverage for PySpark/rddsampler.py
> ---
>
> Key: SPARK-38871
> URL: https://issues.apache.org/jira/browse/SPARK-38871
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Major
>
> Currently, PySpark test coverage is around 91% according to codecov report: 
> [https://app.codecov.io/gh/apache/spark|https://app.codecov.io/gh/apache/spark]
> Since there are still 9% missing tests, so I think it would be great to 
> improve our test coverage.
> Of course we might not target to 100%, but as much as possible, to the level 
> that we can currently cover with CI.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py

2022-04-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38879:


 Summary: Improve the test coverage for pyspark/rddsampler.py
 Key: SPARK-38879
 URL: https://issues.apache.org/jira/browse/SPARK-38879
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: pralabhkumar
Assignee: Hyukjin Kwon
 Fix For: 3.4.0


Improve the test coverage of statcounter.py 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py

2022-04-12 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521135#comment-17521135
 ] 

pralabhkumar edited comment on SPARK-38879 at 4/12/22 1:07 PM:
---

Please allow me to work on this 


was (Author: pralabhkumar):
I will be working on this . 

> Improve the test coverage for pyspark/rddsampler.py
> ---
>
> Key: SPARK-38879
> URL: https://issues.apache.org/jira/browse/SPARK-38879
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>
> Improve the test coverage of rddsampler.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py

2022-04-12 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521135#comment-17521135
 ] 

pralabhkumar commented on SPARK-38879:
--

I will be working on this . 

> Improve the test coverage for pyspark/rddsampler.py
> ---
>
> Key: SPARK-38879
> URL: https://issues.apache.org/jira/browse/SPARK-38879
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>
> Improve the test coverage of rddsampler.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py

2022-04-12 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar updated SPARK-38879:
-
Description: Improve the test coverage of rddsampler.py  (was: Improve the 
test coverage of statcounter.py )

> Improve the test coverage for pyspark/rddsampler.py
> ---
>
> Key: SPARK-38879
> URL: https://issues.apache.org/jira/browse/SPARK-38879
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>
> Improve the test coverage of rddsampler.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38871) Improve the test coverage for PySpark/rddsampler.py

2022-04-12 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar resolved SPARK-38871.
--
Resolution: Invalid

> Improve the test coverage for PySpark/rddsampler.py
> ---
>
> Key: SPARK-38871
> URL: https://issues.apache.org/jira/browse/SPARK-38871
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Major
>
> Currently, PySpark test coverage is around 91% according to codecov report: 
> [https://app.codecov.io/gh/apache/spark|https://app.codecov.io/gh/apache/spark]
> Since there are still 9% missing tests, so I think it would be great to 
> improve our test coverage.
> Of course we might not target to 100%, but as much as possible, to the level 
> that we can currently cover with CI.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-38871) Improve the test coverage for PySpark/rddsampler.py

2022-04-12 Thread pralabhkumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pralabhkumar closed SPARK-38871.


This issue is wrongly created , hence closing it 

> Improve the test coverage for PySpark/rddsampler.py
> ---
>
> Key: SPARK-38871
> URL: https://issues.apache.org/jira/browse/SPARK-38871
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, Tests
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Major
>
> Currently, PySpark test coverage is around 91% according to codecov report: 
> [https://app.codecov.io/gh/apache/spark|https://app.codecov.io/gh/apache/spark]
> Since there are still 9% missing tests, so I think it would be great to 
> improve our test coverage.
> Of course we might not target to 100%, but as much as possible, to the level 
> that we can currently cover with CI.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38874) CLONE - Improve the test coverage for pyspark/ml module

2022-04-12 Thread pralabhkumar (Jira)

pralabhkumar created SPARK-38874:


 Summary: CLONE - Improve the test coverage for pyspark/ml module
 Key: SPARK-38874
 URL: https://issues.apache.org/jira/browse/SPARK-38874
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: pralabhkumar


Currently, ml module has 90% of test coverage.

We could improve the test coverage by adding the missing tests for ml module.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38848) Replcace all `@Test(expected = XXException)` with assertThrows

2022-04-12 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-38848:


Assignee: Yang Jie

> Replcace all `@Test(expected = XXException)` with assertThrows
> --
>
> Key: SPARK-38848
> URL: https://issues.apache.org/jira/browse/SPARK-38848
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> {{@Test}} no longer has {{expected parameters in Junit 5, use assertThrows}} 
> instead



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38848) Replcace all `@Test(expected = XXException)` with assertThrows

2022-04-12 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-38848.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36133
[https://github.com/apache/spark/pull/36133]

> Replcace all `@Test(expected = XXException)` with assertThrows
> --
>
> Key: SPARK-38848
> URL: https://issues.apache.org/jira/browse/SPARK-38848
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> {{@Test}} no longer has {{expected parameters in Junit 5, use assertThrows}} 
> instead



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38847) Introduce a `viewToSeq` function for `KVUtils`

2022-04-12 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-38847:


Assignee: Yang Jie

> Introduce a `viewToSeq` function for `KVUtils`
> --
>
> Key: SPARK-38847
> URL: https://issues.apache.org/jira/browse/SPARK-38847
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> There are many codes in spark that convert KVStoreView into `List`, and these 
> codes will not close `KVStoreIterator`, these resources are mainly recycled 
> by `finalize()` method implemented in `LevelDB` and `RockSB`, this makes 
> `KVStoreIterator` resource recycling unpredictable.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38847) Introduce a `viewToSeq` function for `KVUtils`

2022-04-12 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-38847.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36132
[https://github.com/apache/spark/pull/36132]

> Introduce a `viewToSeq` function for `KVUtils`
> --
>
> Key: SPARK-38847
> URL: https://issues.apache.org/jira/browse/SPARK-38847
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> There are many codes in spark that convert KVStoreView into `List`, and these 
> codes will not close `KVStoreIterator`, these resources are mainly recycled 
> by `finalize()` method implemented in `LevelDB` and `RockSB`, this makes 
> `KVStoreIterator` resource recycling unpredictable.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38689) Use error classes in the compilation errors of not allowed DESC PARTITION

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521227#comment-17521227
 ] 

Apache Spark commented on SPARK-38689:
--

User 'ivoson' has created a pull request for this issue:
https://github.com/apache/spark/pull/36163

> Use error classes in the compilation errors of not allowed DESC PARTITION
> -
>
> Key: SPARK-38689
> URL: https://issues.apache.org/jira/browse/SPARK-38689
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * descPartitionNotAllowedOnTempView
> * descPartitionNotAllowedOnView
> * descPartitionNotAllowedOnViewError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38689) Use error classes in the compilation errors of not allowed DESC PARTITION

2022-04-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38689:


Assignee: Apache Spark

> Use error classes in the compilation errors of not allowed DESC PARTITION
> -
>
> Key: SPARK-38689
> URL: https://issues.apache.org/jira/browse/SPARK-38689
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * descPartitionNotAllowedOnTempView
> * descPartitionNotAllowedOnView
> * descPartitionNotAllowedOnViewError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38689) Use error classes in the compilation errors of not allowed DESC PARTITION

2022-04-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38689:


Assignee: (was: Apache Spark)

> Use error classes in the compilation errors of not allowed DESC PARTITION
> -
>
> Key: SPARK-38689
> URL: https://issues.apache.org/jira/browse/SPARK-38689
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * descPartitionNotAllowedOnTempView
> * descPartitionNotAllowedOnView
> * descPartitionNotAllowedOnViewError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38880) Implement `numeric_only` parameter of `GroupBy.max/min`

2022-04-12 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-38880:


 Summary: Implement `numeric_only` parameter of `GroupBy.max/min`
 Key: SPARK-38880
 URL: https://issues.apache.org/jira/browse/SPARK-38880
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `numeric_only` parameter of `GroupBy.max/min`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38880) Implement `numeric_only` parameter of `GroupBy.max/min`

2022-04-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38880:


Assignee: (was: Apache Spark)

> Implement `numeric_only` parameter of `GroupBy.max/min`
> ---
>
> Key: SPARK-38880
> URL: https://issues.apache.org/jira/browse/SPARK-38880
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `numeric_only` parameter of `GroupBy.max/min`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38880) Implement `numeric_only` parameter of `GroupBy.max/min`

2022-04-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38880:


Assignee: Apache Spark

> Implement `numeric_only` parameter of `GroupBy.max/min`
> ---
>
> Key: SPARK-38880
> URL: https://issues.apache.org/jira/browse/SPARK-38880
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Implement `numeric_only` parameter of `GroupBy.max/min`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38880) Implement `numeric_only` parameter of `GroupBy.max/min`

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521244#comment-17521244
 ] 

Apache Spark commented on SPARK-38880:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/36148

> Implement `numeric_only` parameter of `GroupBy.max/min`
> ---
>
> Key: SPARK-38880
> URL: https://issues.apache.org/jira/browse/SPARK-38880
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `numeric_only` parameter of `GroupBy.max/min`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38880) Implement `numeric_only` parameter of `GroupBy.max/min`

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521245#comment-17521245
 ] 

Apache Spark commented on SPARK-38880:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/36148

> Implement `numeric_only` parameter of `GroupBy.max/min`
> ---
>
> Key: SPARK-38880
> URL: https://issues.apache.org/jira/browse/SPARK-38880
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `numeric_only` parameter of `GroupBy.max/min`



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36620) Client side related push-based shuffle metrics

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521276#comment-17521276
 ] 

Apache Spark commented on SPARK-36620:
--

User 'thejdeep' has created a pull request for this issue:
https://github.com/apache/spark/pull/36165

> Client side related push-based shuffle metrics
> --
>
> Key: SPARK-36620
> URL: https://issues.apache.org/jira/browse/SPARK-36620
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Thejdeep Gudivada
>Priority: Major
>
> Need to add client side related metrics to push-based shuffle.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36620) Client side related push-based shuffle metrics

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521277#comment-17521277
 ] 

Apache Spark commented on SPARK-36620:
--

User 'thejdeep' has created a pull request for this issue:
https://github.com/apache/spark/pull/36165

> Client side related push-based shuffle metrics
> --
>
> Key: SPARK-36620
> URL: https://issues.apache.org/jira/browse/SPARK-36620
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Thejdeep Gudivada
>Priority: Major
>
> Need to add client side related metrics to push-based shuffle.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38881) PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that is already supported in the Scala/Java APIs

2022-04-12 Thread Mark Khaitman (Jira)

Mark Khaitman created SPARK-38881:
-

 Summary: PySpark Kinesis Streaming should expose metricsLevel 
CloudWatch config that is already supported in the Scala/Java APIs
 Key: SPARK-38881
 URL: https://issues.apache.org/jira/browse/SPARK-38881
 Project: Spark
  Issue Type: Improvement
  Components: DStreams, Input/Output, PySpark
Affects Versions: 3.2.1
Reporter: Mark Khaitman


This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was 
merged as part of Spark 3.0.0

This change is desirable as it further exposes the metricsLevel config 
parameter that was added for the Scala/Java Spark APIs when working with the 
Kinesis Streaming integration, and makes it available to the PySpark API as 
well.

This change passes all tests, and local testing was done with a development 
Kinesis stream in AWS, in order to confirm that metrics were no longer being 
reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark 
Kinesis streaming context creation, and also worked as it does today when 
leaving the MetricsLevel parameter out, which would result in a default of 
DETAILED, with CloudWatch metrics appearing again.

I plan to open the PR from my forked repo shortly for further discussion if 
required.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38881) PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that is already supported in the Scala/Java APIs

2022-04-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38881:


Assignee: (was: Apache Spark)

> PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that 
> is already supported in the Scala/Java APIs
> ---
>
> Key: SPARK-38881
> URL: https://issues.apache.org/jira/browse/SPARK-38881
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Input/Output, PySpark
>Affects Versions: 3.2.1
>Reporter: Mark Khaitman
>Priority: Major
>
> This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was 
> merged as part of Spark 3.0.0
> This change is desirable as it further exposes the metricsLevel config 
> parameter that was added for the Scala/Java Spark APIs when working with the 
> Kinesis Streaming integration, and makes it available to the PySpark API as 
> well.
> This change passes all tests, and local testing was done with a development 
> Kinesis stream in AWS, in order to confirm that metrics were no longer being 
> reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark 
> Kinesis streaming context creation, and also worked as it does today when 
> leaving the MetricsLevel parameter out, which would result in a default of 
> DETAILED, with CloudWatch metrics appearing again.
> https://github.com/apache/spark/pull/36166
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38881) PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that is already supported in the Scala/Java APIs

2022-04-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38881:


Assignee: Apache Spark

> PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that 
> is already supported in the Scala/Java APIs
> ---
>
> Key: SPARK-38881
> URL: https://issues.apache.org/jira/browse/SPARK-38881
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Input/Output, PySpark
>Affects Versions: 3.2.1
>Reporter: Mark Khaitman
>Assignee: Apache Spark
>Priority: Major
>
> This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was 
> merged as part of Spark 3.0.0
> This change is desirable as it further exposes the metricsLevel config 
> parameter that was added for the Scala/Java Spark APIs when working with the 
> Kinesis Streaming integration, and makes it available to the PySpark API as 
> well.
> This change passes all tests, and local testing was done with a development 
> Kinesis stream in AWS, in order to confirm that metrics were no longer being 
> reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark 
> Kinesis streaming context creation, and also worked as it does today when 
> leaving the MetricsLevel parameter out, which would result in a default of 
> DETAILED, with CloudWatch metrics appearing again.
> https://github.com/apache/spark/pull/36166
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38881) PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that is already supported in the Scala/Java APIs

2022-04-12 Thread Mark Khaitman (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Khaitman updated SPARK-38881:
--
Description: 
This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was 
merged as part of Spark 3.0.0

This change is desirable as it further exposes the metricsLevel config 
parameter that was added for the Scala/Java Spark APIs when working with the 
Kinesis Streaming integration, and makes it available to the PySpark API as 
well.

This change passes all tests, and local testing was done with a development 
Kinesis stream in AWS, in order to confirm that metrics were no longer being 
reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark 
Kinesis streaming context creation, and also worked as it does today when 
leaving the MetricsLevel parameter out, which would result in a default of 
DETAILED, with CloudWatch metrics appearing again.

https://github.com/apache/spark/pull/36166

 

  was:
This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was 
merged as part of Spark 3.0.0

This change is desirable as it further exposes the metricsLevel config 
parameter that was added for the Scala/Java Spark APIs when working with the 
Kinesis Streaming integration, and makes it available to the PySpark API as 
well.

This change passes all tests, and local testing was done with a development 
Kinesis stream in AWS, in order to confirm that metrics were no longer being 
reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark 
Kinesis streaming context creation, and also worked as it does today when 
leaving the MetricsLevel parameter out, which would result in a default of 
DETAILED, with CloudWatch metrics appearing again.

I plan to open the PR from my forked repo shortly for further discussion if 
required.


> PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that 
> is already supported in the Scala/Java APIs
> ---
>
> Key: SPARK-38881
> URL: https://issues.apache.org/jira/browse/SPARK-38881
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Input/Output, PySpark
>Affects Versions: 3.2.1
>Reporter: Mark Khaitman
>Priority: Major
>
> This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was 
> merged as part of Spark 3.0.0
> This change is desirable as it further exposes the metricsLevel config 
> parameter that was added for the Scala/Java Spark APIs when working with the 
> Kinesis Streaming integration, and makes it available to the PySpark API as 
> well.
> This change passes all tests, and local testing was done with a development 
> Kinesis stream in AWS, in order to confirm that metrics were no longer being 
> reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark 
> Kinesis streaming context creation, and also worked as it does today when 
> leaving the MetricsLevel parameter out, which would result in a default of 
> DETAILED, with CloudWatch metrics appearing again.
> https://github.com/apache/spark/pull/36166
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38881) PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that is already supported in the Scala/Java APIs

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521293#comment-17521293
 ] 

Apache Spark commented on SPARK-38881:
--

User 'mkman84' has created a pull request for this issue:
https://github.com/apache/spark/pull/36166

> PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that 
> is already supported in the Scala/Java APIs
> ---
>
> Key: SPARK-38881
> URL: https://issues.apache.org/jira/browse/SPARK-38881
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Input/Output, PySpark
>Affects Versions: 3.2.1
>Reporter: Mark Khaitman
>Priority: Major
>
> This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was 
> merged as part of Spark 3.0.0
> This change is desirable as it further exposes the metricsLevel config 
> parameter that was added for the Scala/Java Spark APIs when working with the 
> Kinesis Streaming integration, and makes it available to the PySpark API as 
> well.
> This change passes all tests, and local testing was done with a development 
> Kinesis stream in AWS, in order to confirm that metrics were no longer being 
> reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark 
> Kinesis streaming context creation, and also worked as it does today when 
> leaving the MetricsLevel parameter out, which would result in a default of 
> DETAILED, with CloudWatch metrics appearing again.
> https://github.com/apache/spark/pull/36166
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38767) Support ignoreCorruptFiles and ignoreMissingFiles in Data Source options

2022-04-12 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38767:
-

Assignee: Yaohua Cui

> Support ignoreCorruptFiles and ignoreMissingFiles in Data Source options
> 
>
> Key: SPARK-38767
> URL: https://issues.apache.org/jira/browse/SPARK-38767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yaohua Zhao
>Assignee: Yaohua Cui
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38767) Support ignoreCorruptFiles and ignoreMissingFiles in Data Source options

2022-04-12 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38767.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36069
[https://github.com/apache/spark/pull/36069]

> Support ignoreCorruptFiles and ignoreMissingFiles in Data Source options
> 
>
> Key: SPARK-38767
> URL: https://issues.apache.org/jira/browse/SPARK-38767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yaohua Zhao
>Assignee: Yaohua Cui
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38767) Support ignoreCorruptFiles and ignoreMissingFiles in Data Source options

2022-04-12 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38767:
-

Assignee: Yaohua Zhao  (was: Yaohua Cui)

> Support ignoreCorruptFiles and ignoreMissingFiles in Data Source options
> 
>
> Key: SPARK-38767
> URL: https://issues.apache.org/jira/browse/SPARK-38767
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yaohua Zhao
>Assignee: Yaohua Zhao
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38792) Regression in time executor takes to do work sometime after v3.0.1 ?

2022-04-12 Thread Danny Guinther (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521342#comment-17521342
 ] 

Danny Guinther commented on SPARK-38792:


Where does org.apache.spark.sql.execution.collect.Collector live? I can't find 
it an New Relic suggests that the problem may stem from some classes in 
org.apache.spark.sql.execution.collect.*

 

See attached screenshot named what-is-this-code.jpg

> Regression in time executor takes to do work sometime after v3.0.1 ?
> 
>
> Key: SPARK-38792
> URL: https://issues.apache.org/jira/browse/SPARK-38792
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Danny Guinther
>Priority: Major
> Attachments: dummy-job-job.jpg, dummy-job-query.png, 
> executor-timing-debug-number-2.jpg, executor-timing-debug-number-4.jpg, 
> executor-timing-debug-number-5.jpg, min-time-way-up.jpg, 
> what-s-up-with-exec-actions.jpg
>
>
> Hello!
> I'm sorry to trouble you with this, but I'm seeing a noticeable regression in 
> performance when upgrading from 3.0.1 to 3.2.1 and I can't pin down why. I 
> don't believe it is specific to my application since the upgrade to 3.0.1 to 
> 3.2.1 is purely a configuration change. I'd guess it presents itself in my 
> application due to the high volume of work my application does, but I could 
> be mistaken.
> The gist is that it seems like the executor actions I'm running suddenly 
> appear to take a lot longer on Spark 3.2.1. I don't have any ability to test 
> versions between 3.0.1 and 3.2.1 because my application was previously 
> blocked from upgrading beyond Spark 3.0.1 by 
> https://issues.apache.org/jira/browse/SPARK-37391 (which I helped to fix).
> Any ideas what might cause this or metrics I might try to gather to pinpoint 
> the problem? I've tried a bunch of the suggestions from 
> [https://spark.apache.org/docs/latest/tuning.html] to see if any of those 
> help, but none of the adjustments I've tried have been fruitful. I also tried 
> to look in [https://spark.apache.org/docs/latest/sql-migration-guide.html] 
> for ideas as to what might have changed to cause this behavior, but haven't 
> seen anything that sticks out as being a possible source of the problem.
> I have attached a graph that shows the drastic change in time taken by 
> executor actions. In the image the blue and purple lines are different kinds 
> of reads using the built-in JDBC data reader and the green line is writes 
> using a custom-built data writer. The deploy to switch from 3.0.1 to 3.2.1 
> occurred at 9AM on the graph. The graph data comes from timing blocks that 
> surround only the calls to dataframe actions, so there shouldn't be anything 
> specific to my application that is suddenly inflating these numbers. The 
> specific actions I'm invoking are: count() (but there's some transforming and 
> caching going on, so it's really more than that); first(); and write().
> The driver process does seem to be seeing more GC churn then with Spark 
> 3.0.1, but I don't think that explains this behavior. The executors don't 
> seem to have any problem with memory or GC and are not overutilized (our 
> pipeline is very read and write heavy, less heavy on transformations, so 
> executors tend to be idle while waiting for various network I/O).
>  
> Thanks in advance for any help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38792) Regression in time executor takes to do work sometime after v3.0.1 ?

2022-04-12 Thread Danny Guinther (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Guinther updated SPARK-38792:
---
Attachment: what-is-this-code.jpg

> Regression in time executor takes to do work sometime after v3.0.1 ?
> 
>
> Key: SPARK-38792
> URL: https://issues.apache.org/jira/browse/SPARK-38792
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Danny Guinther
>Priority: Major
> Attachments: dummy-job-job.jpg, dummy-job-query.png, 
> executor-timing-debug-number-2.jpg, executor-timing-debug-number-4.jpg, 
> executor-timing-debug-number-5.jpg, min-time-way-up.jpg, 
> what-is-this-code.jpg, what-s-up-with-exec-actions.jpg
>
>
> Hello!
> I'm sorry to trouble you with this, but I'm seeing a noticeable regression in 
> performance when upgrading from 3.0.1 to 3.2.1 and I can't pin down why. I 
> don't believe it is specific to my application since the upgrade to 3.0.1 to 
> 3.2.1 is purely a configuration change. I'd guess it presents itself in my 
> application due to the high volume of work my application does, but I could 
> be mistaken.
> The gist is that it seems like the executor actions I'm running suddenly 
> appear to take a lot longer on Spark 3.2.1. I don't have any ability to test 
> versions between 3.0.1 and 3.2.1 because my application was previously 
> blocked from upgrading beyond Spark 3.0.1 by 
> https://issues.apache.org/jira/browse/SPARK-37391 (which I helped to fix).
> Any ideas what might cause this or metrics I might try to gather to pinpoint 
> the problem? I've tried a bunch of the suggestions from 
> [https://spark.apache.org/docs/latest/tuning.html] to see if any of those 
> help, but none of the adjustments I've tried have been fruitful. I also tried 
> to look in [https://spark.apache.org/docs/latest/sql-migration-guide.html] 
> for ideas as to what might have changed to cause this behavior, but haven't 
> seen anything that sticks out as being a possible source of the problem.
> I have attached a graph that shows the drastic change in time taken by 
> executor actions. In the image the blue and purple lines are different kinds 
> of reads using the built-in JDBC data reader and the green line is writes 
> using a custom-built data writer. The deploy to switch from 3.0.1 to 3.2.1 
> occurred at 9AM on the graph. The graph data comes from timing blocks that 
> surround only the calls to dataframe actions, so there shouldn't be anything 
> specific to my application that is suddenly inflating these numbers. The 
> specific actions I'm invoking are: count() (but there's some transforming and 
> caching going on, so it's really more than that); first(); and write().
> The driver process does seem to be seeing more GC churn then with Spark 
> 3.0.1, but I don't think that explains this behavior. The executors don't 
> seem to have any problem with memory or GC and are not overutilized (our 
> pipeline is very read and write heavy, less heavy on transformations, so 
> executors tend to be idle while waiting for various network I/O).
>  
> Thanks in advance for any help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38882) The usage logger attachment logic should handle static methods properly.

2022-04-12 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-38882:
-

 Summary: The usage logger attachment logic should handle static 
methods properly.
 Key: SPARK-38882
 URL: https://issues.apache.org/jira/browse/SPARK-38882
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.1, 3.3.0
Reporter: Takuya Ueshin


The usage logger attachment logic has an issue when handling static methods.

For example,

{code}
$ PYSPARK_PANDAS_USAGE_LOGGER=pyspark.pandas.usage_logging.usage_logger 
./bin/pyspark
{code}

{code:python}
>>> import pyspark.pandas as ps
>>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]})
>>> psdf.from_records([(1, 2), (3, 4)])
A function `DataFrame.from_records(data, index, exclude, columns, coerce_float, 
nrows)` was failed after 2007.430 ms: 0
Traceback (most recent call last):
...
{code}

without usage logger:

{code:python}
>>> import pyspark.pandas as ps
>>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]})
>>> psdf.from_records([(1, 2), (3, 4)])
   0  1
0  1  2
1  3  4
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38882) The usage logger attachment logic should handle static methods properly.

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521355#comment-17521355
 ] 

Apache Spark commented on SPARK-38882:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/36167

> The usage logger attachment logic should handle static methods properly.
> 
>
> Key: SPARK-38882
> URL: https://issues.apache.org/jira/browse/SPARK-38882
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> The usage logger attachment logic has an issue when handling static methods.
> For example,
> {code}
> $ PYSPARK_PANDAS_USAGE_LOGGER=pyspark.pandas.usage_logging.usage_logger 
> ./bin/pyspark
> {code}
> {code:python}
> >>> import pyspark.pandas as ps
> >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]})
> >>> psdf.from_records([(1, 2), (3, 4)])
> A function `DataFrame.from_records(data, index, exclude, columns, 
> coerce_float, nrows)` was failed after 2007.430 ms: 0
> Traceback (most recent call last):
> ...
> {code}
> without usage logger:
> {code:python}
> >>> import pyspark.pandas as ps
> >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]})
> >>> psdf.from_records([(1, 2), (3, 4)])
>0  1
> 0  1  2
> 1  3  4
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38882) The usage logger attachment logic should handle static methods properly.

2022-04-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38882:


Assignee: (was: Apache Spark)

> The usage logger attachment logic should handle static methods properly.
> 
>
> Key: SPARK-38882
> URL: https://issues.apache.org/jira/browse/SPARK-38882
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> The usage logger attachment logic has an issue when handling static methods.
> For example,
> {code}
> $ PYSPARK_PANDAS_USAGE_LOGGER=pyspark.pandas.usage_logging.usage_logger 
> ./bin/pyspark
> {code}
> {code:python}
> >>> import pyspark.pandas as ps
> >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]})
> >>> psdf.from_records([(1, 2), (3, 4)])
> A function `DataFrame.from_records(data, index, exclude, columns, 
> coerce_float, nrows)` was failed after 2007.430 ms: 0
> Traceback (most recent call last):
> ...
> {code}
> without usage logger:
> {code:python}
> >>> import pyspark.pandas as ps
> >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]})
> >>> psdf.from_records([(1, 2), (3, 4)])
>0  1
> 0  1  2
> 1  3  4
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38882) The usage logger attachment logic should handle static methods properly.

2022-04-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38882:


Assignee: Apache Spark

> The usage logger attachment logic should handle static methods properly.
> 
>
> Key: SPARK-38882
> URL: https://issues.apache.org/jira/browse/SPARK-38882
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> The usage logger attachment logic has an issue when handling static methods.
> For example,
> {code}
> $ PYSPARK_PANDAS_USAGE_LOGGER=pyspark.pandas.usage_logging.usage_logger 
> ./bin/pyspark
> {code}
> {code:python}
> >>> import pyspark.pandas as ps
> >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]})
> >>> psdf.from_records([(1, 2), (3, 4)])
> A function `DataFrame.from_records(data, index, exclude, columns, 
> coerce_float, nrows)` was failed after 2007.430 ms: 0
> Traceback (most recent call last):
> ...
> {code}
> without usage logger:
> {code:python}
> >>> import pyspark.pandas as ps
> >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]})
> >>> psdf.from_records([(1, 2), (3, 4)])
>0  1
> 0  1  2
> 1  3  4
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38852) Better Data Source V2 operator pushdown framework

2022-04-12 Thread Erik Krogen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521364#comment-17521364
 ] 

Erik Krogen commented on SPARK-38852:
-

What's the relationship between this and SPARK-38788? Seems like they are 
laying out the same goal?

> Better Data Source V2 operator pushdown framework
> -
>
> Key: SPARK-38852
> URL: https://issues.apache.org/jira/browse/SPARK-38852
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark supports push down Filters and Aggregates to data source.
> However, the Data Source V2 operator pushdown framework has the following 
> shortcomings:
> # Only simple filter and aggregate are supported, which makes it impossible 
> to apply in most scenarios
> # The incompatibility of SQL syntax makes it impossible to apply in most 
> scenarios
> # Aggregate push down does not support multiple partitions of data sources
> # Spark's additional aggregate will cause some overhead
> # Limit push down is not supported
> # Top n push down is not supported
> # Offset push down is not supported



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38788) More comprehensive DSV2 push down capabilities

2022-04-12 Thread Erik Krogen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521363#comment-17521363
 ] 

Erik Krogen commented on SPARK-38788:
-

What's the relationship between this and SPARK-38852? Seems like they are 
laying out the same goal?

> More comprehensive DSV2 push down capabilities
> --
>
> Key: SPARK-38788
> URL: https://issues.apache.org/jira/browse/SPARK-38788
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Get together all tickets related to push down (filters) via Datasource V2 
> APIs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38882) The usage logger attachment logic should handle static methods properly.

2022-04-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38882.
--
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/36167

> The usage logger attachment logic should handle static methods properly.
> 
>
> Key: SPARK-38882
> URL: https://issues.apache.org/jira/browse/SPARK-38882
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> The usage logger attachment logic has an issue when handling static methods.
> For example,
> {code}
> $ PYSPARK_PANDAS_USAGE_LOGGER=pyspark.pandas.usage_logging.usage_logger 
> ./bin/pyspark
> {code}
> {code:python}
> >>> import pyspark.pandas as ps
> >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]})
> >>> psdf.from_records([(1, 2), (3, 4)])
> A function `DataFrame.from_records(data, index, exclude, columns, 
> coerce_float, nrows)` was failed after 2007.430 ms: 0
> Traceback (most recent call last):
> ...
> {code}
> without usage logger:
> {code:python}
> >>> import pyspark.pandas as ps
> >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]})
> >>> psdf.from_records([(1, 2), (3, 4)])
>0  1
> 0  1  2
> 1  3  4
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38882) The usage logger attachment logic should handle static methods properly.

2022-04-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38882:
-
Fix Version/s: 3.3.0

> The usage logger attachment logic should handle static methods properly.
> 
>
> Key: SPARK-38882
> URL: https://issues.apache.org/jira/browse/SPARK-38882
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
> Fix For: 3.3.0
>
>
> The usage logger attachment logic has an issue when handling static methods.
> For example,
> {code}
> $ PYSPARK_PANDAS_USAGE_LOGGER=pyspark.pandas.usage_logging.usage_logger 
> ./bin/pyspark
> {code}
> {code:python}
> >>> import pyspark.pandas as ps
> >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]})
> >>> psdf.from_records([(1, 2), (3, 4)])
> A function `DataFrame.from_records(data, index, exclude, columns, 
> coerce_float, nrows)` was failed after 2007.430 ms: 0
> Traceback (most recent call last):
> ...
> {code}
> without usage logger:
> {code:python}
> >>> import pyspark.pandas as ps
> >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]})
> >>> psdf.from_records([(1, 2), (3, 4)])
>0  1
> 0  1  2
> 1  3  4
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38878) CLONE - Improve the test coverage for pyspark/statcounter.py

2022-04-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38878.
--
Resolution: Invalid

> CLONE - Improve the test coverage for pyspark/statcounter.py
> 
>
> Key: SPARK-38878
> URL: https://issues.apache.org/jira/browse/SPARK-38878
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>
> Improve the test coverage of statcounter.py 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38876) CLONE - Improve the test coverage for pyspark/*.py

2022-04-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38876.
--
Resolution: Invalid

> CLONE - Improve the test coverage for pyspark/*.py
> --
>
> Key: SPARK-38876
> URL: https://issues.apache.org/jira/browse/SPARK-38876
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Major
>
> Currently, there are several Python scripts under pyspark/ directory. (e.g. 
> rdd.py, util.py, serializers.py, ...)
> We could improve the test coverage by adding the missing tests for these 
> scripts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38877) CLONE - Improve the test coverage for pyspark/find_spark_home.py

2022-04-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38877.
--
Resolution: Invalid

> CLONE - Improve the test coverage for pyspark/find_spark_home.py
> 
>
> Key: SPARK-38877
> URL: https://issues.apache.org/jira/browse/SPARK-38877
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: pralabhkumar
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>
> We should test when the environment variables are not set 
> (https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/find_spark_home.py)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38875) CLONE - Improve the test coverage for pyspark/sql module

2022-04-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38875.
--
Resolution: Invalid

> CLONE - Improve the test coverage for pyspark/sql module
> 
>
> Key: SPARK-38875
> URL: https://issues.apache.org/jira/browse/SPARK-38875
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Major
>
> Currently, sql module has 90% of test coverage.
> We could improve the test coverage by adding the missing tests for sql module.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38872) CLONE - Improve the test coverage for pyspark/pandas module

2022-04-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38872.
--
Resolution: Invalid

> CLONE - Improve the test coverage for pyspark/pandas module
> ---
>
> Key: SPARK-38872
> URL: https://issues.apache.org/jira/browse/SPARK-38872
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Major
>
> Currently, pandas module (pandas API on Spark) has 94% of test coverage.
> We could improve the test coverage by adding the missing tests for pandas 
> module.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38873) CLONE - Improve the test coverage for pyspark/mllib module

2022-04-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38873.
--
Resolution: Invalid

> CLONE - Improve the test coverage for pyspark/mllib module
> --
>
> Key: SPARK-38873
> URL: https://issues.apache.org/jira/browse/SPARK-38873
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Major
>
> Currently, mllib module has 88% of test coverage.
> We could improve the test coverage by adding the missing tests for mllib 
> module.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38874) CLONE - Improve the test coverage for pyspark/ml module

2022-04-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38874.
--
Resolution: Invalid

> CLONE - Improve the test coverage for pyspark/ml module
> ---
>
> Key: SPARK-38874
> URL: https://issues.apache.org/jira/browse/SPARK-38874
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Major
>
> Currently, ml module has 90% of test coverage.
> We could improve the test coverage by adding the missing tests for ml module.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py

2022-04-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38879:


Assignee: (was: Hyukjin Kwon)

> Improve the test coverage for pyspark/rddsampler.py
> ---
>
> Key: SPARK-38879
> URL: https://issues.apache.org/jira/browse/SPARK-38879
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
> Fix For: 3.4.0
>
>
> Improve the test coverage of rddsampler.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py

2022-04-12 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521398#comment-17521398
 ] 

Hyukjin Kwon commented on SPARK-38879:
--

[~pralabhkumar] please just go ahead. no need to ask :-).

> Improve the test coverage for pyspark/rddsampler.py
> ---
>
> Key: SPARK-38879
> URL: https://issues.apache.org/jira/browse/SPARK-38879
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>
> Improve the test coverage of rddsampler.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38854) Improve the test coverage for pyspark/statcounter.py

2022-04-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38854:


Assignee: pralabhkumar  (was: Hyukjin Kwon)

> Improve the test coverage for pyspark/statcounter.py
> 
>
> Key: SPARK-38854
> URL: https://issues.apache.org/jira/browse/SPARK-38854
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: pralabhkumar
>Priority: Minor
> Fix For: 3.4.0
>
>
> Improve the test coverage of statcounter.py 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py

2022-04-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38879:
-
Fix Version/s: (was: 3.4.0)

> Improve the test coverage for pyspark/rddsampler.py
> ---
>
> Key: SPARK-38879
> URL: https://issues.apache.org/jira/browse/SPARK-38879
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Priority: Minor
>
> Improve the test coverage of rddsampler.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38822) Raise indexError when insert loc is out of bounds

2022-04-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38822:


Assignee: Yikun Jiang

> Raise indexError when insert loc is out of bounds
> -
>
> Key: SPARK-38822
> URL: https://issues.apache.org/jira/browse/SPARK-38822
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>
>  
> [https://github.com/apache/spark/blob/becda3339381b3975ed567c156260eda036d7a1b/python/pyspark/pandas/tests/indexes/test_base.py#L2179]
>  
> we need to raise indexError when out of bounds for axis, and also change test 
> case
>  
>  - Related changes:
> - panda 1.4.0+ is using numpy insert: 
> https://github.com/pandas-dev/pandas/commit/c021d33ecf0e096a186edb731964767e9288a875
> - Since numpy 1.8 (10 years ago 
> https://github.com/numpy/numpy/commit/908e06c3c465434023649b0ca522836580c5cfdc)
>  : [`out-of-bound indices will generate an 
> error.`](https://numpy.org/devdocs/release/1.8.0-notes.html#changes)
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38822) Raise indexError when insert loc is out of bounds

2022-04-12 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38822.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36115
[https://github.com/apache/spark/pull/36115]

> Raise indexError when insert loc is out of bounds
> -
>
> Key: SPARK-38822
> URL: https://issues.apache.org/jira/browse/SPARK-38822
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
>  
> [https://github.com/apache/spark/blob/becda3339381b3975ed567c156260eda036d7a1b/python/pyspark/pandas/tests/indexes/test_base.py#L2179]
>  
> we need to raise indexError when out of bounds for axis, and also change test 
> case
>  
>  - Related changes:
> - panda 1.4.0+ is using numpy insert: 
> https://github.com/pandas-dev/pandas/commit/c021d33ecf0e096a186edb731964767e9288a875
> - Since numpy 1.8 (10 years ago 
> https://github.com/numpy/numpy/commit/908e06c3c465434023649b0ca522836580c5cfdc)
>  : [`out-of-bound indices will generate an 
> error.`](https://numpy.org/devdocs/release/1.8.0-notes.html#changes)
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38822) Raise indexError when insert loc is out of bounds

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521411#comment-17521411
 ] 

Apache Spark commented on SPARK-38822:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36168

> Raise indexError when insert loc is out of bounds
> -
>
> Key: SPARK-38822
> URL: https://issues.apache.org/jira/browse/SPARK-38822
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
>  
> [https://github.com/apache/spark/blob/becda3339381b3975ed567c156260eda036d7a1b/python/pyspark/pandas/tests/indexes/test_base.py#L2179]
>  
> we need to raise indexError when out of bounds for axis, and also change test 
> case
>  
>  - Related changes:
> - panda 1.4.0+ is using numpy insert: 
> https://github.com/pandas-dev/pandas/commit/c021d33ecf0e096a186edb731964767e9288a875
> - Since numpy 1.8 (10 years ago 
> https://github.com/numpy/numpy/commit/908e06c3c465434023649b0ca522836580c5cfdc)
>  : [`out-of-bound indices will generate an 
> error.`](https://numpy.org/devdocs/release/1.8.0-notes.html#changes)
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38822) Raise indexError when insert loc is out of bounds

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521412#comment-17521412
 ] 

Apache Spark commented on SPARK-38822:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36168

> Raise indexError when insert loc is out of bounds
> -
>
> Key: SPARK-38822
> URL: https://issues.apache.org/jira/browse/SPARK-38822
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
>  
> [https://github.com/apache/spark/blob/becda3339381b3975ed567c156260eda036d7a1b/python/pyspark/pandas/tests/indexes/test_base.py#L2179]
>  
> we need to raise indexError when out of bounds for axis, and also change test 
> case
>  
>  - Related changes:
> - panda 1.4.0+ is using numpy insert: 
> https://github.com/pandas-dev/pandas/commit/c021d33ecf0e096a186edb731964767e9288a875
> - Since numpy 1.8 (10 years ago 
> https://github.com/numpy/numpy/commit/908e06c3c465434023649b0ca522836580c5cfdc)
>  : [`out-of-bound indices will generate an 
> error.`](https://numpy.org/devdocs/release/1.8.0-notes.html#changes)
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38823) Incorrect result of dataset reduceGroups in java

2022-04-12 Thread Bruce Robbins (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521414#comment-17521414
 ] 

Bruce Robbins commented on SPARK-38823:
---

This appears to be an optimization bug that results in corruption of the 
buffers in {{AggregationIterator}}.

On master and 3.3, {{NewInstance}} with no arguments is considered foldable. As 
a result, the {{ConstantFolding}} rule turns NewInstance into a Literal holding 
an instance of the user's specified Java bean. The instance becomes a singleton 
that gets reused for each input record (although its fields get updated by 
{{InitializeJavaBean}}).

Because the instance gets reused, sometimes multiple buffers in 
{{AggregationIterator}} are actually referring to the same Java bean instance.

Take, for example, the test I added 
[here|https://github.com/bersprockets/spark/blob/17a8ad64f5bc39cb26d25b63f3692e7b8632baf8/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanDeserializationSuite.java#L560].

The input is:
{noformat}
List items = Arrays.asList(
new Item("a", 1),
new Item("b", 3),
new Item("c", 2),
new Item("a", 7));
{noformat}
As {{ObjectAggregationIterator}} reads the input, the buffers get set up as 
follows (note that the first field of Item should be the same as the key):
{noformat}
- Read Item("a", 1)

- Buffers are now:
  Key "a" --> Item("a", 1)

- Read Item("b", 3)

- Buffers are now:
  Key "a" -> Item("b", 3)
  Key "b" -> Item("b", 3)
{noformat}
The buffer for key "a" now contains Item("b", 3). That's because both buffers 
contain a reference to the same Item instance, and that Item instance's fields 
were updated when {{Item("b", 3)}} was read.

When {{AggregateIterator}} finally calls the test's reduce function, it will 
pass the same Item instance ({{Item("a", 7)}}) as both the buffer and the input 
record. At that point, the buffers for "a", "b", and "c" will all contain 
{{Item("a", 7)}}.

I _think_ the fix for this is to make {{NewInstance}} non-foldable. My linked 
test passes with that change (and fails without it). I will run the unit tests 
and hopefully make a PR tomorrow, assuming the proposed fix doesn't break 
something else besides {{ConstantFoldingSuite}}.


> Incorrect result of dataset reduceGroups in java
> 
>
> Key: SPARK-38823
> URL: https://issues.apache.org/jira/browse/SPARK-38823
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.4.0
>Reporter: IKozar
>Priority: Major
>
> {code:java}
>   @Data
>   @NoArgsConstructor
>   @AllArgsConstructor
>   public static class Item implements Serializable {
> private String x;
> private String y;
> private int z;
> public Item addZ(int z) {
>   return new Item(x, y, this.z + z);
> }
>   } {code}
> {code:java}
> List items = List.of(
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X3", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X1", "Y2", 1),
>  new Item("X2", "Y1", 1)); 
> Dataset ds = spark.createDataFrame(items, 
> Item.class).as(Encoders.bean(Item.class)); 
> ds.groupByKey((MapFunction>) item -> 
> Tuple2.apply(item.getX(), item.getY()),
> Encoders.tuple(Encoders.STRING(), Encoders.STRING())) 
> .reduceGroups((ReduceFunction) (item1, item2) -> 
>   item1.addZ(item2.getZ()))
>  .show(10);
> {code}
> result is
> {noformat}
> ++--+
> | key|ReduceAggregator(poc.job.JavaSparkReduce$Item)|
> ++--+
> |{X1, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X2, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X1, Y2}|   {X2, Y1, 1}|
> |{X3, Y1}|   {X2, Y1, 1}|
> ++--+{noformat}
> pay attention that key doesn't mach with value



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38823) Incorrect result of dataset reduceGroups in java

2022-04-12 Thread Bruce Robbins (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-38823:
--
Labels: correctness  (was: )

> Incorrect result of dataset reduceGroups in java
> 
>
> Key: SPARK-38823
> URL: https://issues.apache.org/jira/browse/SPARK-38823
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.4.0
>Reporter: IKozar
>Priority: Major
>  Labels: correctness
>
> {code:java}
>   @Data
>   @NoArgsConstructor
>   @AllArgsConstructor
>   public static class Item implements Serializable {
> private String x;
> private String y;
> private int z;
> public Item addZ(int z) {
>   return new Item(x, y, this.z + z);
> }
>   } {code}
> {code:java}
> List items = List.of(
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X3", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X1", "Y2", 1),
>  new Item("X2", "Y1", 1)); 
> Dataset ds = spark.createDataFrame(items, 
> Item.class).as(Encoders.bean(Item.class)); 
> ds.groupByKey((MapFunction>) item -> 
> Tuple2.apply(item.getX(), item.getY()),
> Encoders.tuple(Encoders.STRING(), Encoders.STRING())) 
> .reduceGroups((ReduceFunction) (item1, item2) -> 
>   item1.addZ(item2.getZ()))
>  .show(10);
> {code}
> result is
> {noformat}
> ++--+
> | key|ReduceAggregator(poc.job.JavaSparkReduce$Item)|
> ++--+
> |{X1, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X2, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X1, Y2}|   {X2, Y1, 1}|
> |{X3, Y1}|   {X2, Y1, 1}|
> ++--+{noformat}
> pay attention that key doesn't mach with value



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38823) Incorrect result of dataset reduceGroups in java

2022-04-12 Thread Bruce Robbins (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-38823:
--
Affects Version/s: 3.3.0

> Incorrect result of dataset reduceGroups in java
> 
>
> Key: SPARK-38823
> URL: https://issues.apache.org/jira/browse/SPARK-38823
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.3.0, 3.4.0
>Reporter: IKozar
>Priority: Major
>  Labels: correctness
>
> {code:java}
>   @Data
>   @NoArgsConstructor
>   @AllArgsConstructor
>   public static class Item implements Serializable {
> private String x;
> private String y;
> private int z;
> public Item addZ(int z) {
>   return new Item(x, y, this.z + z);
> }
>   } {code}
> {code:java}
> List items = List.of(
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X2", "Y1", 1),
>  new Item("X3", "Y1", 1),
>  new Item("X1", "Y1", 1),
>  new Item("X1", "Y2", 1),
>  new Item("X2", "Y1", 1)); 
> Dataset ds = spark.createDataFrame(items, 
> Item.class).as(Encoders.bean(Item.class)); 
> ds.groupByKey((MapFunction>) item -> 
> Tuple2.apply(item.getX(), item.getY()),
> Encoders.tuple(Encoders.STRING(), Encoders.STRING())) 
> .reduceGroups((ReduceFunction) (item1, item2) -> 
>   item1.addZ(item2.getZ()))
>  .show(10);
> {code}
> result is
> {noformat}
> ++--+
> | key|ReduceAggregator(poc.job.JavaSparkReduce$Item)|
> ++--+
> |{X1, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X2, Y1}|   {X2, Y1, 2}|-- expected 3
> |{X1, Y2}|   {X2, Y1, 1}|
> |{X3, Y1}|   {X2, Y1, 1}|
> ++--+{noformat}
> pay attention that key doesn't mach with value



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38883) smaller pyspark install if not using streaming?

2022-04-12 Thread t oo (Jira)

t oo created SPARK-38883:


 Summary: smaller pyspark install if not using streaming?
 Key: SPARK-38883
 URL: https://issues.apache.org/jira/browse/SPARK-38883
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.2.1
Reporter: t oo


h3. Describe the feature

i am trying to include pyspark in my docker image, but the size is around 300MB

the largest jar is rocksdbjni-6.20.3.jar at 35MB

is it safe to remove this jar if i have no need for SparkStreaming?

is there any advice on getting the install smaller? perhaps a map of which jars 
are needed for batch vs sql vs streaming?
h3. Use Case

smaller python package means i can pack more concurrent pods on to my eks 
workers



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38721) Test the error class: CANNOT_PARSE_DECIMAL

2022-04-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38721:


Assignee: (was: Apache Spark)

> Test the error class: CANNOT_PARSE_DECIMAL
> --
>
> Key: SPARK-38721
> URL: https://issues.apache.org/jira/browse/SPARK-38721
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *CANNOT_PARSE_DECIMAL* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def cannotParseDecimalError(): Throwable = {
> new SparkIllegalStateException(errorClass = "CANNOT_PARSE_DECIMAL",
>   messageParameters = Array.empty)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38721) Test the error class: CANNOT_PARSE_DECIMAL

2022-04-12 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38721:


Assignee: Apache Spark

> Test the error class: CANNOT_PARSE_DECIMAL
> --
>
> Key: SPARK-38721
> URL: https://issues.apache.org/jira/browse/SPARK-38721
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *CANNOT_PARSE_DECIMAL* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def cannotParseDecimalError(): Throwable = {
> new SparkIllegalStateException(errorClass = "CANNOT_PARSE_DECIMAL",
>   messageParameters = Array.empty)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38721) Test the error class: CANNOT_PARSE_DECIMAL

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521422#comment-17521422
 ] 

Apache Spark commented on SPARK-38721:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36169

> Test the error class: CANNOT_PARSE_DECIMAL
> --
>
> Key: SPARK-38721
> URL: https://issues.apache.org/jira/browse/SPARK-38721
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *CANNOT_PARSE_DECIMAL* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def cannotParseDecimalError(): Throwable = {
> new SparkIllegalStateException(errorClass = "CANNOT_PARSE_DECIMAL",
>   messageParameters = Array.empty)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38804) Add StreamingQueryManager.removeListener in PySpark

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521423#comment-17521423
 ] 

Apache Spark commented on SPARK-38804:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36170

> Add StreamingQueryManager.removeListener in PySpark
> ---
>
> Key: SPARK-38804
> URL: https://issues.apache.org/jira/browse/SPARK-38804
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> SPARK-38759 added StreamingQueryManager.addListener. We should add 
> removeListener as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38721) Test the error class: CANNOT_PARSE_DECIMAL

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521425#comment-17521425
 ] 

Apache Spark commented on SPARK-38721:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36169

> Test the error class: CANNOT_PARSE_DECIMAL
> --
>
> Key: SPARK-38721
> URL: https://issues.apache.org/jira/browse/SPARK-38721
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *CANNOT_PARSE_DECIMAL* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def cannotParseDecimalError(): Throwable = {
> new SparkIllegalStateException(errorClass = "CANNOT_PARSE_DECIMAL",
>   messageParameters = Array.empty)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38804) Add StreamingQueryManager.removeListener in PySpark

2022-04-12 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521427#comment-17521427
 ] 

Apache Spark commented on SPARK-38804:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36170

> Add StreamingQueryManager.removeListener in PySpark
> ---
>
> Key: SPARK-38804
> URL: https://issues.apache.org/jira/browse/SPARK-38804
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> SPARK-38759 added StreamingQueryManager.addListener. We should add 
> removeListener as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 123 matches

Mail list logo