[jira] [Updated] (SPARK-38857) test_mode test failed due to 1.4.1-1.4.3 bug
[ https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-38857: Component/s: (was: Tests) > test_mode test failed due to 1.4.1-1.4.3 bug > > > Key: SPARK-38857 > URL: https://issues.apache.org/jira/browse/SPARK-38857 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > [https://github.com/pandas-dev/pandas/issues/46737] > We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a > issue. > > update: > Pandas community confirmed it's a unexpected but right changes, so series > name should be preserved in pser.mode(). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38857) series name should be preserved in pser.mode()
[ https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-38857: Summary: series name should be preserved in pser.mode() (was: test_mode test failed due to 1.4.1-1.4.3 bug) > series name should be preserved in pser.mode() > -- > > Key: SPARK-38857 > URL: https://issues.apache.org/jira/browse/SPARK-38857 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > [https://github.com/pandas-dev/pandas/issues/46737] > We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a > issue. > > update: > Pandas community confirmed it's a unexpected but right changes, so series > name should be preserved in pser.mode(). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38857) series name should be preserved in series.mode()
[ https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-38857: Summary: series name should be preserved in series.mode() (was: series name should be preserved in pser.mode()) > series name should be preserved in series.mode() > > > Key: SPARK-38857 > URL: https://issues.apache.org/jira/browse/SPARK-38857 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > [https://github.com/pandas-dev/pandas/issues/46737] > We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a > issue. > > update: > Pandas community confirmed it's a unexpected but right changes, so series > name should be preserved in pser.mode(). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38857) test_mode test failed due to 1.4.1-1.4.3 bug
[ https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-38857: Description: [https://github.com/pandas-dev/pandas/issues/46737] We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a issue. update: Pandas community confirm it's a unexpected but right changes, so series name should be preserved in pser.mode(). was: [https://github.com/pandas-dev/pandas/issues/46737] We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a issue. > test_mode test failed due to 1.4.1-1.4.3 bug > > > Key: SPARK-38857 > URL: https://issues.apache.org/jira/browse/SPARK-38857 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > [https://github.com/pandas-dev/pandas/issues/46737] > We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a > issue. > > update: > Pandas community confirm it's a unexpected but right changes, so series name > should be preserved in pser.mode(). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38857) test_mode test failed due to 1.4.1-1.4.3 bug
[ https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-38857: Description: [https://github.com/pandas-dev/pandas/issues/46737] We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a issue. update: Pandas community confirmed it's a unexpected but right changes, so series name should be preserved in pser.mode(). was: [https://github.com/pandas-dev/pandas/issues/46737] We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a issue. update: Pandas community confirm it's a unexpected but right changes, so series name should be preserved in pser.mode(). > test_mode test failed due to 1.4.1-1.4.3 bug > > > Key: SPARK-38857 > URL: https://issues.apache.org/jira/browse/SPARK-38857 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > [https://github.com/pandas-dev/pandas/issues/46737] > We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a > issue. > > update: > Pandas community confirmed it's a unexpected but right changes, so series > name should be preserved in pser.mode(). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38725) Test the error class: DUPLICATE_KEY
[ https://issues.apache.org/jira/browse/SPARK-38725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521003#comment-17521003 ] panbingkun commented on SPARK-38725: I am working on this. Thanks [~maxgekk] > Test the error class: DUPLICATE_KEY > --- > > Key: SPARK-38725 > URL: https://issues.apache.org/jira/browse/SPARK-38725 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Add at least one test for the error class *DUPLICATE_KEY* to > QueryParsingErrorsSuite. The test should cover the exception throw in > QueryParsingErrors: > {code:scala} > def duplicateKeysError(key: String, ctx: ParserRuleContext): Throwable = { > // Found duplicate keys '$key' > new ParseException(errorClass = "DUPLICATE_KEY", messageParameters = > Array(key), ctx) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38868) `assert_true` fails unconditionnaly after `left_outer` joins
Fabien Dubosson created SPARK-38868: --- Summary: `assert_true` fails unconditionnaly after `left_outer` joins Key: SPARK-38868 URL: https://issues.apache.org/jira/browse/SPARK-38868 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.1, 3.2.0, 3.1.2, 3.1.1 Reporter: Fabien Dubosson When `assert_true` is used after a `left_outer` join the assert exception is raised even though all the rows meet the condition. Using an `inner` join does not expose this issue. {code:java} from pyspark.sql import SparkSession from pyspark.sql import functions as sf session = SparkSession.builder.getOrCreate() entries = session.createDataFrame( [ ("a", 1), ("b", 2), ("c", 3), ], ["id", "outcome_id"], ) outcomes = session.createDataFrame( [ (1, 12), (2, 34), (3, 32), ], ["outcome_id", "outcome_value"], ) # Inner join works as expected ( entries.join(outcomes, on="outcome_id", how="inner") .withColumn("valid", sf.assert_true(sf.col("outcome_value") > 10)) .filter(sf.col("valid").isNull()) .show() ) # Left join fails with «'('outcome_value > 10)' is not true!» even though it is the case ( entries.join(outcomes, on="outcome_id", how="left_outer") .withColumn("valid", sf.assert_true(sf.col("outcome_value") > 10)) .filter(sf.col("valid").isNull()) .show() ){code} Reproduced on `pyspark` versions: `3.2.1`, `3.2.0`, `3.1.2` and `3.1.1`. I am not sure if "native" Spark exposes this issue as well or not, I don't have the knowledge/setup to test that. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38857) series name should be preserved in series.mode()
[ https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521017#comment-17521017 ] Apache Spark commented on SPARK-38857: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/36159 > series name should be preserved in series.mode() > > > Key: SPARK-38857 > URL: https://issues.apache.org/jira/browse/SPARK-38857 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > [https://github.com/pandas-dev/pandas/issues/46737] > We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a > issue. > > update: > Pandas community confirmed it's a unexpected but right changes, so series > name should be preserved in pser.mode(). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38857) series name should be preserved in series.mode()
[ https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38857: Assignee: (was: Apache Spark) > series name should be preserved in series.mode() > > > Key: SPARK-38857 > URL: https://issues.apache.org/jira/browse/SPARK-38857 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > [https://github.com/pandas-dev/pandas/issues/46737] > We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a > issue. > > update: > Pandas community confirmed it's a unexpected but right changes, so series > name should be preserved in pser.mode(). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38857) series name should be preserved in series.mode()
[ https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38857: Assignee: Apache Spark > series name should be preserved in series.mode() > > > Key: SPARK-38857 > URL: https://issues.apache.org/jira/browse/SPARK-38857 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > > [https://github.com/pandas-dev/pandas/issues/46737] > We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a > issue. > > update: > Pandas community confirmed it's a unexpected but right changes, so series > name should be preserved in pser.mode(). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38857) series name should be preserved in series.mode()
[ https://issues.apache.org/jira/browse/SPARK-38857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521018#comment-17521018 ] Apache Spark commented on SPARK-38857: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/36159 > series name should be preserved in series.mode() > > > Key: SPARK-38857 > URL: https://issues.apache.org/jira/browse/SPARK-38857 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > [https://github.com/pandas-dev/pandas/issues/46737] > We might want to skip this test in 1.4.1-1.4.3 after pandas confirmed it's a > issue. > > update: > Pandas community confirmed it's a unexpected but right changes, so series > name should be preserved in pser.mode(). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38821) test_nsmallest test failed tue to pandas 1.4.0-1.4.2 bug
[ https://issues.apache.org/jira/browse/SPARK-38821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-38821: Summary: test_nsmallest test failed tue to pandas 1.4.0-1.4.2 bug (was: test_nsmallest test failed tue to pandas 1.4.1/1.4.2 bug) > test_nsmallest test failed tue to pandas 1.4.0-1.4.2 bug > > > Key: SPARK-38821 > URL: https://issues.apache.org/jira/browse/SPARK-38821 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > [https://github.com/apache/spark/blob/becda3339381b3975ed567c156260eda036d7a1b/python/pyspark/pandas/tests/test_dataframe.py#L1829] > > After [https://github.com/pandas-dev/pandas/issues/46589] fixed, we need skip > L1829 from v1.4.0 to v1.4.x -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37935) Migrate onto error classes
[ https://issues.apache.org/jira/browse/SPARK-37935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521024#comment-17521024 ] Apache Spark commented on SPARK-37935: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/36160 > Migrate onto error classes > -- > > Key: SPARK-37935 > URL: https://issues.apache.org/jira/browse/SPARK-37935 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The PR https://github.com/apache/spark/pull/32850 introduced error classes as > a part of the error messages framework > (https://issues.apache.org/jira/browse/SPARK-33539). Need to migrate all > exceptions from QueryExecutionErrors, QueryCompilationErrors and > QueryParsingErrors on the error classes using instances of SparkThrowable, > and carefully test every error class by writing tests in dedicated test > suites: > * QueryExecutionErrorsSuite for the errors that are occurred during query > execution > * QueryCompilationErrorsSuite ... query compilation or eagerly executing > commands > * QueryParsingErrorsSuite ... parsing errors > Here is an example https://github.com/apache/spark/pull/35157 of how an > existing Java exception can be replaced, and testing of related error > classes.At the end, we should migrate all exceptions from the files > Query.*Errors.scala and cover all error classes from the error-classes.json > file by tests. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38725) Test the error class: DUPLICATE_KEY
[ https://issues.apache.org/jira/browse/SPARK-38725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521026#comment-17521026 ] Apache Spark commented on SPARK-38725: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/36160 > Test the error class: DUPLICATE_KEY > --- > > Key: SPARK-38725 > URL: https://issues.apache.org/jira/browse/SPARK-38725 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Add at least one test for the error class *DUPLICATE_KEY* to > QueryParsingErrorsSuite. The test should cover the exception throw in > QueryParsingErrors: > {code:scala} > def duplicateKeysError(key: String, ctx: ParserRuleContext): Throwable = { > // Found duplicate keys '$key' > new ParseException(errorClass = "DUPLICATE_KEY", messageParameters = > Array(key), ctx) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38725) Test the error class: DUPLICATE_KEY
[ https://issues.apache.org/jira/browse/SPARK-38725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521027#comment-17521027 ] Apache Spark commented on SPARK-38725: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/36160 > Test the error class: DUPLICATE_KEY > --- > > Key: SPARK-38725 > URL: https://issues.apache.org/jira/browse/SPARK-38725 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Add at least one test for the error class *DUPLICATE_KEY* to > QueryParsingErrorsSuite. The test should cover the exception throw in > QueryParsingErrors: > {code:scala} > def duplicateKeysError(key: String, ctx: ParserRuleContext): Throwable = { > // Found duplicate keys '$key' > new ParseException(errorClass = "DUPLICATE_KEY", messageParameters = > Array(key), ctx) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38725) Test the error class: DUPLICATE_KEY
[ https://issues.apache.org/jira/browse/SPARK-38725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38725: Assignee: Apache Spark > Test the error class: DUPLICATE_KEY > --- > > Key: SPARK-38725 > URL: https://issues.apache.org/jira/browse/SPARK-38725 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Minor > Labels: starter > > Add at least one test for the error class *DUPLICATE_KEY* to > QueryParsingErrorsSuite. The test should cover the exception throw in > QueryParsingErrors: > {code:scala} > def duplicateKeysError(key: String, ctx: ParserRuleContext): Throwable = { > // Found duplicate keys '$key' > new ParseException(errorClass = "DUPLICATE_KEY", messageParameters = > Array(key), ctx) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37935) Migrate onto error classes
[ https://issues.apache.org/jira/browse/SPARK-37935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521025#comment-17521025 ] Apache Spark commented on SPARK-37935: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/36160 > Migrate onto error classes > -- > > Key: SPARK-37935 > URL: https://issues.apache.org/jira/browse/SPARK-37935 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The PR https://github.com/apache/spark/pull/32850 introduced error classes as > a part of the error messages framework > (https://issues.apache.org/jira/browse/SPARK-33539). Need to migrate all > exceptions from QueryExecutionErrors, QueryCompilationErrors and > QueryParsingErrors on the error classes using instances of SparkThrowable, > and carefully test every error class by writing tests in dedicated test > suites: > * QueryExecutionErrorsSuite for the errors that are occurred during query > execution > * QueryCompilationErrorsSuite ... query compilation or eagerly executing > commands > * QueryParsingErrorsSuite ... parsing errors > Here is an example https://github.com/apache/spark/pull/35157 of how an > existing Java exception can be replaced, and testing of related error > classes.At the end, we should migrate all exceptions from the files > Query.*Errors.scala and cover all error classes from the error-classes.json > file by tests. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38725) Test the error class: DUPLICATE_KEY
[ https://issues.apache.org/jira/browse/SPARK-38725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38725: Assignee: (was: Apache Spark) > Test the error class: DUPLICATE_KEY > --- > > Key: SPARK-38725 > URL: https://issues.apache.org/jira/browse/SPARK-38725 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Add at least one test for the error class *DUPLICATE_KEY* to > QueryParsingErrorsSuite. The test should cover the exception throw in > QueryParsingErrors: > {code:scala} > def duplicateKeysError(key: String, ctx: ParserRuleContext): Throwable = { > // Found duplicate keys '$key' > new ParseException(errorClass = "DUPLICATE_KEY", messageParameters = > Array(key), ctx) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38869) Respect Table capability `ACCEPT_ANY_SCHEMA` in default column resolution
Gengliang Wang created SPARK-38869: -- Summary: Respect Table capability `ACCEPT_ANY_SCHEMA` in default column resolution Key: SPARK-38869 URL: https://issues.apache.org/jira/browse/SPARK-38869 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Gengliang Wang Assignee: Daniel If a V2 table has the capability of [ACCEPT_ANY_SCHEMA|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableCapability.java#L94], we should skip adding default column values to the insert schema. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38870) SparkSession.builder returns a new builder in Scala, but not in Python
Furcy Pin created SPARK-38870: - Summary: SparkSession.builder returns a new builder in Scala, but not in Python Key: SPARK-38870 URL: https://issues.apache.org/jira/browse/SPARK-38870 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.2.1 Reporter: Furcy Pin In pyspark, _SparkSession.builder_ always returns the same static builder, while the expected behaviour should be the same as in Scala, where it returns a new builder each time. *How to reproduce* When we run the following code in Scala : {code:java} import org.apache.spark.sql.SparkSession val s1 = SparkSession.builder.master("local[2]").config("key", "value").getOrCreate() println("A : " + s1.conf.get("key")) // value s1.conf.set("key", "new_value") println("B : " + s1.conf.get("key")) // new_value val s2 = SparkSession.builder.getOrCreate() println("C : " + s1.conf.get("key")) // new_value{code} The output is : {code:java} A : value B : new_value C : new_value <<<{code} But when we run the following (supposedly equivalent) code in Python: {code:java} from pyspark.sql import SparkSession s1 = SparkSession.builder.master("local[2]").config("key", "value").getOrCreate() print("A : " + s1.conf.get("key")) s1.conf.set("key", "new_value") print("B : " + s1.conf.get("key")) s2 = SparkSession.builder.getOrCreate() print("C : " + s1.conf.get("key")){code} The output is : {code:java} A : value B : new_value C : value <<< {code} *Root cause analysis* This comes from the fact that _SparkSession.builder_ behaves differently in Python than in Scala. In Scala, it returns a *new builder* each time, in Python it returns *the same builder* every time, and the SparkSession.Builder._options are static, too. Because of this, whenever _SparkSession.builder.getOrCreate()_ is called, the options passed to the very first builder are re-applied every time, and overrides the option that were set afterwards. This leads to very awkward behavior in every Spark version up to 3.2.1 included {*}Example{*}: This example crashes, but was fixed by SPARK-37638 {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", "DYNAMIC").getOrCreate() assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "DYNAMIC" # OK spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC") assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # OK from pyspark.sql import functions as f from pyspark.sql.types import StringType f.col("a").cast(StringType()) assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # This fails in all versions until the SPARK-37638 fix # because before that fix, Column.cast() calle SparkSession.builder.getOrCreate(){code} But this example still crashes in the current version on the master branch {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", "DYNAMIC").getOrCreate() assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "DYNAMIC" # OK spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC") assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # OK SparkSession.builder.getOrCreate() assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # This assert fails in master{code} I will make a Pull Request to fix this bug shortly. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38870) SparkSession.builder returns a new builder in Scala, but not in Python
[ https://issues.apache.org/jira/browse/SPARK-38870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Furcy Pin updated SPARK-38870: -- Description: In pyspark, _SparkSession.builder_ always returns the same static builder, while the expected behaviour should be the same as in Scala, where it returns a new builder each time. *How to reproduce* When we run the following code in Scala : {code:java} import org.apache.spark.sql.SparkSession val s1 = SparkSession.builder.master("local[2]").config("key", "value").getOrCreate() println("A : " + s1.conf.get("key")) // value s1.conf.set("key", "new_value") println("B : " + s1.conf.get("key")) // new_value val s2 = SparkSession.builder.getOrCreate() println("C : " + s1.conf.get("key")) // new_value{code} The output is : {code:java} A : value B : new_value C : new_value <<<{code} But when we run the following (supposedly equivalent) code in Python: {code:java} from pyspark.sql import SparkSession s1 = SparkSession.builder.master("local[2]").config("key", "value").getOrCreate() print("A : " + s1.conf.get("key")) s1.conf.set("key", "new_value") print("B : " + s1.conf.get("key")) s2 = SparkSession.builder.getOrCreate() print("C : " + s1.conf.get("key")){code} The output is : {code:java} A : value B : new_value C : value <<< {code} *Root cause analysis* This comes from the fact that _SparkSession.builder_ behaves differently in Python than in Scala. In Scala, it returns a *new builder* each time, in Python it returns *the same builder* every time, and the SparkSession.Builder._options are static, too. Because of this, whenever _SparkSession.builder.getOrCreate()_ is called, the options passed to the very first builder are re-applied every time, and overrides the option that were set afterwards. This leads to very awkward behavior in every Spark version up to 3.2.1 included {*}Example{*}: This example crashes, but was fixed by SPARK-37638 {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", "DYNAMIC").getOrCreate() assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "DYNAMIC" # OK spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC") assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # OK from pyspark.sql import functions as f from pyspark.sql.types import StringType f.col("a").cast(StringType()) assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # This fails in all versions until the SPARK-37638 fix # because before that fix, Column.cast() calle SparkSession.builder.getOrCreate(){code} But this example still crashes in the current version on the master branch {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", "DYNAMIC").getOrCreate() assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "DYNAMIC" # OK spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC") assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # OK SparkSession.builder.getOrCreate() assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # This assert fails in master{code} I will make a Pull Request to fix this bug shortly. was: In pyspark, _SparkSession.builder_ always returns the same static builder, while the expected behaviour should be the same as in Scala, where it returns a new builder each time. *How to reproduce* When we run the following code in Scala : {code:java} import org.apache.spark.sql.SparkSession val s1 = SparkSession.builder.master("local[2]").config("key", "value").getOrCreate() println("A : " + s1.conf.get("key")) // value s1.conf.set("key", "new_value") println("B : " + s1.conf.get("key")) // new_value val s2 = SparkSession.builder.getOrCreate() println("C : " + s1.conf.get("key")) // new_value{code} The output is : {code:java} A : value B : new_value C : new_value <<<{code} But when we run the following (supposedly equivalent) code in Python: {code:java} from pyspark.sql import SparkSession s1 = SparkSession.builder.master("local[2]").config("key", "value").getOrCreate() print("A : " + s1.conf.get("key")) s1.conf.set("key", "new_value") print("B : " + s1.conf.get("key")) s2 = SparkSession.builder.getOrCreate() print("C : " + s1.conf.get("key")){code} The output is : {code:java} A : value B : new_value C : value <<< {code} *Root cause analysis* This comes from the fact that _SparkSession.builder_ behaves differently in Python than in Scala. In Scala, it returns a *new builder* each time, in Python it returns *the same builder* every time, and the SparkSession.Builder._options are static, too. Because of this, whenever _SparkSession.builder.getOrCreate()_ is called, the options
[jira] [Updated] (SPARK-38870) SparkSession.builder returns a new builder in Scala, but not in Python
[ https://issues.apache.org/jira/browse/SPARK-38870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Furcy Pin updated SPARK-38870: -- Description: In pyspark, _SparkSession.builder_ always returns the same static builder, while the expected behaviour should be the same as in Scala, where it returns a new builder each time. *How to reproduce* When we run the following code in Scala : {code:java} import org.apache.spark.sql.SparkSession val s1 = SparkSession.builder.master("local[2]").config("key", "value").getOrCreate() println("A : " + s1.conf.get("key")) // value s1.conf.set("key", "new_value") println("B : " + s1.conf.get("key")) // new_value val s2 = SparkSession.builder.getOrCreate() println("C : " + s1.conf.get("key")) // new_value{code} The output is : {code:java} A : value B : new_value C : new_value <<<{code} But when we run the following (supposedly equivalent) code in Python: {code:java} from pyspark.sql import SparkSession s1 = SparkSession.builder.master("local[2]").config("key", "value").getOrCreate() print("A : " + s1.conf.get("key")) s1.conf.set("key", "new_value") print("B : " + s1.conf.get("key")) s2 = SparkSession.builder.getOrCreate() print("C : " + s1.conf.get("key")){code} The output is : {code:java} A : value B : new_value C : value <<< {code} *Root cause analysis* This comes from the fact that _SparkSession.builder_ behaves differently in Python than in Scala. In Scala, it returns a *new builder* each time, in Python it returns *the same builder* every time, and the SparkSession.Builder._options are static, too. Because of this, whenever _SparkSession.builder.getOrCreate()_ is called, the options passed to the very first builder are re-applied every time, and overrides the option that were set afterwards. This leads to very awkward behavior in every Spark version up to 3.2.1 included {*}Example{*}: This example crashes, but was fixed by SPARK-37638 {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", "DYNAMIC").getOrCreate() assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "DYNAMIC" # OK spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC") assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # OK from pyspark.sql import functions as f from pyspark.sql.types import StringType f.col("a").cast(StringType()) assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # This fails in all versions until the SPARK-37638 fix # because before that fix, Column.cast() calle SparkSession.builder.getOrCreate(){code} But this example still crashes in the current version on the master branch {code:java} from pyspark.sql import SparkSession spark = SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", "DYNAMIC").getOrCreate() assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "DYNAMIC" # OK spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC") assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # OK SparkSession.builder.getOrCreate() assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" # This assert fails in master{code} I made a Pull Request to fix this bug : https://github.com/apache/spark/pull/36161 was: In pyspark, _SparkSession.builder_ always returns the same static builder, while the expected behaviour should be the same as in Scala, where it returns a new builder each time. *How to reproduce* When we run the following code in Scala : {code:java} import org.apache.spark.sql.SparkSession val s1 = SparkSession.builder.master("local[2]").config("key", "value").getOrCreate() println("A : " + s1.conf.get("key")) // value s1.conf.set("key", "new_value") println("B : " + s1.conf.get("key")) // new_value val s2 = SparkSession.builder.getOrCreate() println("C : " + s1.conf.get("key")) // new_value{code} The output is : {code:java} A : value B : new_value C : new_value <<<{code} But when we run the following (supposedly equivalent) code in Python: {code:java} from pyspark.sql import SparkSession s1 = SparkSession.builder.master("local[2]").config("key", "value").getOrCreate() print("A : " + s1.conf.get("key")) s1.conf.set("key", "new_value") print("B : " + s1.conf.get("key")) s2 = SparkSession.builder.getOrCreate() print("C : " + s1.conf.get("key")){code} The output is : {code:java} A : value B : new_value C : value <<< {code} *Root cause analysis* This comes from the fact that _SparkSession.builder_ behaves differently in Python than in Scala. In Scala, it returns a *new builder* each time, in Python it returns *the same builder* every time, and the SparkSession.Builder._options are static, too. Because of this, whenever _SparkSession.builder.getOrCreate()_ is ca
[jira] [Commented] (SPARK-38870) SparkSession.builder returns a new builder in Scala, but not in Python
[ https://issues.apache.org/jira/browse/SPARK-38870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521064#comment-17521064 ] Apache Spark commented on SPARK-38870: -- User 'FurcyPin' has created a pull request for this issue: https://github.com/apache/spark/pull/36161 > SparkSession.builder returns a new builder in Scala, but not in Python > -- > > Key: SPARK-38870 > URL: https://issues.apache.org/jira/browse/SPARK-38870 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.2.1 >Reporter: Furcy Pin >Priority: Major > > In pyspark, _SparkSession.builder_ always returns the same static builder, > while the expected behaviour should be the same as in Scala, where it returns > a new builder each time. > *How to reproduce* > When we run the following code in Scala : > {code:java} > import org.apache.spark.sql.SparkSession > val s1 = SparkSession.builder.master("local[2]").config("key", > "value").getOrCreate() > println("A : " + s1.conf.get("key")) // value > s1.conf.set("key", "new_value") > println("B : " + s1.conf.get("key")) // new_value > val s2 = SparkSession.builder.getOrCreate() > println("C : " + s1.conf.get("key")) // new_value{code} > The output is : > {code:java} > A : value > B : new_value > C : new_value <<<{code} > > But when we run the following (supposedly equivalent) code in Python: > {code:java} > from pyspark.sql import SparkSession > s1 = SparkSession.builder.master("local[2]").config("key", > "value").getOrCreate() > print("A : " + s1.conf.get("key")) > s1.conf.set("key", "new_value") > print("B : " + s1.conf.get("key")) > s2 = SparkSession.builder.getOrCreate() > print("C : " + s1.conf.get("key")){code} > The output is : > {code:java} > A : value > B : new_value > C : value <<< > {code} > > > *Root cause analysis* > This comes from the fact that _SparkSession.builder_ behaves differently in > Python than in Scala. In Scala, it returns a *new builder* each time, in > Python it returns *the same builder* every time, and the > SparkSession.Builder._options are static, too. > Because of this, whenever _SparkSession.builder.getOrCreate()_ is called, the > options passed to the very first builder are re-applied every time, and > overrides the option that were set afterwards. > This leads to very awkward behavior in every Spark version up to 3.2.1 > included > {*}Example{*}: > This example crashes, but was fixed by SPARK-37638 > > {code:java} > from pyspark.sql import SparkSession > spark = > SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", > "DYNAMIC").getOrCreate() > assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == > "DYNAMIC" # OK > spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC") > assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" > # OK > from pyspark.sql import functions as f > from pyspark.sql.types import StringType > f.col("a").cast(StringType()) > assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" > # This fails in all versions until the SPARK-37638 fix > # because before that fix, Column.cast() calle > SparkSession.builder.getOrCreate(){code} > > But this example still crashes in the current version on the master branch > {code:java} > from pyspark.sql import SparkSession > spark = > SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", > "DYNAMIC").getOrCreate() > assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == > "DYNAMIC" # OK > spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC") > assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" > # OK > SparkSession.builder.getOrCreate() > assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" > # This assert fails in master{code} > > I made a Pull Request to fix this bug : > https://github.com/apache/spark/pull/36161 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38870) SparkSession.builder returns a new builder in Scala, but not in Python
[ https://issues.apache.org/jira/browse/SPARK-38870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38870: Assignee: Apache Spark > SparkSession.builder returns a new builder in Scala, but not in Python > -- > > Key: SPARK-38870 > URL: https://issues.apache.org/jira/browse/SPARK-38870 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.2.1 >Reporter: Furcy Pin >Assignee: Apache Spark >Priority: Major > > In pyspark, _SparkSession.builder_ always returns the same static builder, > while the expected behaviour should be the same as in Scala, where it returns > a new builder each time. > *How to reproduce* > When we run the following code in Scala : > {code:java} > import org.apache.spark.sql.SparkSession > val s1 = SparkSession.builder.master("local[2]").config("key", > "value").getOrCreate() > println("A : " + s1.conf.get("key")) // value > s1.conf.set("key", "new_value") > println("B : " + s1.conf.get("key")) // new_value > val s2 = SparkSession.builder.getOrCreate() > println("C : " + s1.conf.get("key")) // new_value{code} > The output is : > {code:java} > A : value > B : new_value > C : new_value <<<{code} > > But when we run the following (supposedly equivalent) code in Python: > {code:java} > from pyspark.sql import SparkSession > s1 = SparkSession.builder.master("local[2]").config("key", > "value").getOrCreate() > print("A : " + s1.conf.get("key")) > s1.conf.set("key", "new_value") > print("B : " + s1.conf.get("key")) > s2 = SparkSession.builder.getOrCreate() > print("C : " + s1.conf.get("key")){code} > The output is : > {code:java} > A : value > B : new_value > C : value <<< > {code} > > > *Root cause analysis* > This comes from the fact that _SparkSession.builder_ behaves differently in > Python than in Scala. In Scala, it returns a *new builder* each time, in > Python it returns *the same builder* every time, and the > SparkSession.Builder._options are static, too. > Because of this, whenever _SparkSession.builder.getOrCreate()_ is called, the > options passed to the very first builder are re-applied every time, and > overrides the option that were set afterwards. > This leads to very awkward behavior in every Spark version up to 3.2.1 > included > {*}Example{*}: > This example crashes, but was fixed by SPARK-37638 > > {code:java} > from pyspark.sql import SparkSession > spark = > SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", > "DYNAMIC").getOrCreate() > assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == > "DYNAMIC" # OK > spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC") > assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" > # OK > from pyspark.sql import functions as f > from pyspark.sql.types import StringType > f.col("a").cast(StringType()) > assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" > # This fails in all versions until the SPARK-37638 fix > # because before that fix, Column.cast() calle > SparkSession.builder.getOrCreate(){code} > > But this example still crashes in the current version on the master branch > {code:java} > from pyspark.sql import SparkSession > spark = > SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", > "DYNAMIC").getOrCreate() > assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == > "DYNAMIC" # OK > spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC") > assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" > # OK > SparkSession.builder.getOrCreate() > assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" > # This assert fails in master{code} > > I made a Pull Request to fix this bug : > https://github.com/apache/spark/pull/36161 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38870) SparkSession.builder returns a new builder in Scala, but not in Python
[ https://issues.apache.org/jira/browse/SPARK-38870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38870: Assignee: (was: Apache Spark) > SparkSession.builder returns a new builder in Scala, but not in Python > -- > > Key: SPARK-38870 > URL: https://issues.apache.org/jira/browse/SPARK-38870 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.2.1 >Reporter: Furcy Pin >Priority: Major > > In pyspark, _SparkSession.builder_ always returns the same static builder, > while the expected behaviour should be the same as in Scala, where it returns > a new builder each time. > *How to reproduce* > When we run the following code in Scala : > {code:java} > import org.apache.spark.sql.SparkSession > val s1 = SparkSession.builder.master("local[2]").config("key", > "value").getOrCreate() > println("A : " + s1.conf.get("key")) // value > s1.conf.set("key", "new_value") > println("B : " + s1.conf.get("key")) // new_value > val s2 = SparkSession.builder.getOrCreate() > println("C : " + s1.conf.get("key")) // new_value{code} > The output is : > {code:java} > A : value > B : new_value > C : new_value <<<{code} > > But when we run the following (supposedly equivalent) code in Python: > {code:java} > from pyspark.sql import SparkSession > s1 = SparkSession.builder.master("local[2]").config("key", > "value").getOrCreate() > print("A : " + s1.conf.get("key")) > s1.conf.set("key", "new_value") > print("B : " + s1.conf.get("key")) > s2 = SparkSession.builder.getOrCreate() > print("C : " + s1.conf.get("key")){code} > The output is : > {code:java} > A : value > B : new_value > C : value <<< > {code} > > > *Root cause analysis* > This comes from the fact that _SparkSession.builder_ behaves differently in > Python than in Scala. In Scala, it returns a *new builder* each time, in > Python it returns *the same builder* every time, and the > SparkSession.Builder._options are static, too. > Because of this, whenever _SparkSession.builder.getOrCreate()_ is called, the > options passed to the very first builder are re-applied every time, and > overrides the option that were set afterwards. > This leads to very awkward behavior in every Spark version up to 3.2.1 > included > {*}Example{*}: > This example crashes, but was fixed by SPARK-37638 > > {code:java} > from pyspark.sql import SparkSession > spark = > SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", > "DYNAMIC").getOrCreate() > assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == > "DYNAMIC" # OK > spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC") > assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" > # OK > from pyspark.sql import functions as f > from pyspark.sql.types import StringType > f.col("a").cast(StringType()) > assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" > # This fails in all versions until the SPARK-37638 fix > # because before that fix, Column.cast() calle > SparkSession.builder.getOrCreate(){code} > > But this example still crashes in the current version on the master branch > {code:java} > from pyspark.sql import SparkSession > spark = > SparkSession.builder.config("spark.sql.sources.partitionOverwriteMode", > "DYNAMIC").getOrCreate() > assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == > "DYNAMIC" # OK > spark.conf.set("spark.sql.sources.partitionOverwriteMode", "STATIC") > assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" > # OK > SparkSession.builder.getOrCreate() > assert spark.conf.get("spark.sql.sources.partitionOverwriteMode") == "STATIC" > # This assert fails in master{code} > > I made a Pull Request to fix this bug : > https://github.com/apache/spark/pull/36161 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32170) Improve the speculation for the inefficient tasks by the task metrics.
[ https://issues.apache.org/jira/browse/SPARK-32170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521084#comment-17521084 ] Apache Spark commented on SPARK-32170: -- User 'weixiuli' has created a pull request for this issue: https://github.com/apache/spark/pull/36162 > Improve the speculation for the inefficient tasks by the task metrics. > --- > > Key: SPARK-32170 > URL: https://issues.apache.org/jira/browse/SPARK-32170 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 3.0.0 >Reporter: weixiuli >Priority: Major > > 1) Tasks will be speculated when meet certain conditions no matter they are > inefficient or not,this would be a huge waste of cluster resources. > 2) In production,the speculation task comes from an efficient one will be > killed finally,which is unnecessary and will waste of cluster resources. > 3) So, we should evaluate whether the task is inefficient by success tasks > metrics firstly, and then decide to speculate it or not. The inefficient > task will be speculated and efficient one will not, it is better for the > cluster resources. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38854) Improve the test coverage for pyspark/statcounter.py
[ https://issues.apache.org/jira/browse/SPARK-38854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38854. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36145 [https://github.com/apache/spark/pull/36145] > Improve the test coverage for pyspark/statcounter.py > > > Key: SPARK-38854 > URL: https://issues.apache.org/jira/browse/SPARK-38854 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.4.0 > > > Improve the test coverage of statcounter.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38589) New SQL function: try_avg
[ https://issues.apache.org/jira/browse/SPARK-38589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-38589. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35896 [https://github.com/apache/spark/pull/35896] > New SQL function: try_avg > - > > Key: SPARK-38589 > URL: https://issues.apache.org/jira/browse/SPARK-38589 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38873) CLONE - Improve the test coverage for pyspark/mllib module
pralabhkumar created SPARK-38873: Summary: CLONE - Improve the test coverage for pyspark/mllib module Key: SPARK-38873 URL: https://issues.apache.org/jira/browse/SPARK-38873 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: pralabhkumar Currently, mllib module has 88% of test coverage. We could improve the test coverage by adding the missing tests for mllib module. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38871) Improve the test coverage for PySpark/rddsampler.py
pralabhkumar created SPARK-38871: Summary: Improve the test coverage for PySpark/rddsampler.py Key: SPARK-38871 URL: https://issues.apache.org/jira/browse/SPARK-38871 Project: Spark Issue Type: Umbrella Components: PySpark, Tests Affects Versions: 3.3.0 Reporter: pralabhkumar Currently, PySpark test coverage is around 91% according to codecov report: [https://app.codecov.io/gh/apache/spark|https://app.codecov.io/gh/apache/spark] Since there are still 9% missing tests, so I think it would be great to improve our test coverage. Of course we might not target to 100%, but as much as possible, to the level that we can currently cover with CI. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38877) CLONE - Improve the test coverage for pyspark/find_spark_home.py
pralabhkumar created SPARK-38877: Summary: CLONE - Improve the test coverage for pyspark/find_spark_home.py Key: SPARK-38877 URL: https://issues.apache.org/jira/browse/SPARK-38877 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: pralabhkumar Assignee: Hyukjin Kwon Fix For: 3.4.0 We should test when the environment variables are not set (https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/find_spark_home.py) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38875) CLONE - Improve the test coverage for pyspark/sql module
pralabhkumar created SPARK-38875: Summary: CLONE - Improve the test coverage for pyspark/sql module Key: SPARK-38875 URL: https://issues.apache.org/jira/browse/SPARK-38875 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: pralabhkumar Currently, sql module has 90% of test coverage. We could improve the test coverage by adding the missing tests for sql module. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38878) CLONE - Improve the test coverage for pyspark/statcounter.py
pralabhkumar created SPARK-38878: Summary: CLONE - Improve the test coverage for pyspark/statcounter.py Key: SPARK-38878 URL: https://issues.apache.org/jira/browse/SPARK-38878 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: pralabhkumar Assignee: Hyukjin Kwon Fix For: 3.4.0 Improve the test coverage of statcounter.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38876) CLONE - Improve the test coverage for pyspark/*.py
pralabhkumar created SPARK-38876: Summary: CLONE - Improve the test coverage for pyspark/*.py Key: SPARK-38876 URL: https://issues.apache.org/jira/browse/SPARK-38876 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: pralabhkumar Currently, there are several Python scripts under pyspark/ directory. (e.g. rdd.py, util.py, serializers.py, ...) We could improve the test coverage by adding the missing tests for these scripts. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38872) CLONE - Improve the test coverage for pyspark/pandas module
pralabhkumar created SPARK-38872: Summary: CLONE - Improve the test coverage for pyspark/pandas module Key: SPARK-38872 URL: https://issues.apache.org/jira/browse/SPARK-38872 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: pralabhkumar Currently, pandas module (pandas API on Spark) has 94% of test coverage. We could improve the test coverage by adding the missing tests for pandas module. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38871) Improve the test coverage for PySpark/rddsampler.py
[ https://issues.apache.org/jira/browse/SPARK-38871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521134#comment-17521134 ] pralabhkumar commented on SPARK-38871: -- Please close this one , wrongly cloned > Improve the test coverage for PySpark/rddsampler.py > --- > > Key: SPARK-38871 > URL: https://issues.apache.org/jira/browse/SPARK-38871 > Project: Spark > Issue Type: Umbrella > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Major > > Currently, PySpark test coverage is around 91% according to codecov report: > [https://app.codecov.io/gh/apache/spark|https://app.codecov.io/gh/apache/spark] > Since there are still 9% missing tests, so I think it would be great to > improve our test coverage. > Of course we might not target to 100%, but as much as possible, to the level > that we can currently cover with CI. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py
pralabhkumar created SPARK-38879: Summary: Improve the test coverage for pyspark/rddsampler.py Key: SPARK-38879 URL: https://issues.apache.org/jira/browse/SPARK-38879 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: pralabhkumar Assignee: Hyukjin Kwon Fix For: 3.4.0 Improve the test coverage of statcounter.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py
[ https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521135#comment-17521135 ] pralabhkumar edited comment on SPARK-38879 at 4/12/22 1:07 PM: --- Please allow me to work on this was (Author: pralabhkumar): I will be working on this . > Improve the test coverage for pyspark/rddsampler.py > --- > > Key: SPARK-38879 > URL: https://issues.apache.org/jira/browse/SPARK-38879 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.4.0 > > > Improve the test coverage of rddsampler.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py
[ https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521135#comment-17521135 ] pralabhkumar commented on SPARK-38879: -- I will be working on this . > Improve the test coverage for pyspark/rddsampler.py > --- > > Key: SPARK-38879 > URL: https://issues.apache.org/jira/browse/SPARK-38879 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.4.0 > > > Improve the test coverage of rddsampler.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py
[ https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar updated SPARK-38879: - Description: Improve the test coverage of rddsampler.py (was: Improve the test coverage of statcounter.py ) > Improve the test coverage for pyspark/rddsampler.py > --- > > Key: SPARK-38879 > URL: https://issues.apache.org/jira/browse/SPARK-38879 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.4.0 > > > Improve the test coverage of rddsampler.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38871) Improve the test coverage for PySpark/rddsampler.py
[ https://issues.apache.org/jira/browse/SPARK-38871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar resolved SPARK-38871. -- Resolution: Invalid > Improve the test coverage for PySpark/rddsampler.py > --- > > Key: SPARK-38871 > URL: https://issues.apache.org/jira/browse/SPARK-38871 > Project: Spark > Issue Type: Umbrella > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Major > > Currently, PySpark test coverage is around 91% according to codecov report: > [https://app.codecov.io/gh/apache/spark|https://app.codecov.io/gh/apache/spark] > Since there are still 9% missing tests, so I think it would be great to > improve our test coverage. > Of course we might not target to 100%, but as much as possible, to the level > that we can currently cover with CI. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-38871) Improve the test coverage for PySpark/rddsampler.py
[ https://issues.apache.org/jira/browse/SPARK-38871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pralabhkumar closed SPARK-38871. This issue is wrongly created , hence closing it > Improve the test coverage for PySpark/rddsampler.py > --- > > Key: SPARK-38871 > URL: https://issues.apache.org/jira/browse/SPARK-38871 > Project: Spark > Issue Type: Umbrella > Components: PySpark, Tests >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Major > > Currently, PySpark test coverage is around 91% according to codecov report: > [https://app.codecov.io/gh/apache/spark|https://app.codecov.io/gh/apache/spark] > Since there are still 9% missing tests, so I think it would be great to > improve our test coverage. > Of course we might not target to 100%, but as much as possible, to the level > that we can currently cover with CI. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38874) CLONE - Improve the test coverage for pyspark/ml module
pralabhkumar created SPARK-38874: Summary: CLONE - Improve the test coverage for pyspark/ml module Key: SPARK-38874 URL: https://issues.apache.org/jira/browse/SPARK-38874 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.3.0 Reporter: pralabhkumar Currently, ml module has 90% of test coverage. We could improve the test coverage by adding the missing tests for ml module. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38848) Replcace all `@Test(expected = XXException)` with assertThrows
[ https://issues.apache.org/jira/browse/SPARK-38848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-38848: Assignee: Yang Jie > Replcace all `@Test(expected = XXException)` with assertThrows > -- > > Key: SPARK-38848 > URL: https://issues.apache.org/jira/browse/SPARK-38848 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > {{@Test}} no longer has {{expected parameters in Junit 5, use assertThrows}} > instead -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38848) Replcace all `@Test(expected = XXException)` with assertThrows
[ https://issues.apache.org/jira/browse/SPARK-38848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-38848. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36133 [https://github.com/apache/spark/pull/36133] > Replcace all `@Test(expected = XXException)` with assertThrows > -- > > Key: SPARK-38848 > URL: https://issues.apache.org/jira/browse/SPARK-38848 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > {{@Test}} no longer has {{expected parameters in Junit 5, use assertThrows}} > instead -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38847) Introduce a `viewToSeq` function for `KVUtils`
[ https://issues.apache.org/jira/browse/SPARK-38847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-38847: Assignee: Yang Jie > Introduce a `viewToSeq` function for `KVUtils` > -- > > Key: SPARK-38847 > URL: https://issues.apache.org/jira/browse/SPARK-38847 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > There are many codes in spark that convert KVStoreView into `List`, and these > codes will not close `KVStoreIterator`, these resources are mainly recycled > by `finalize()` method implemented in `LevelDB` and `RockSB`, this makes > `KVStoreIterator` resource recycling unpredictable. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38847) Introduce a `viewToSeq` function for `KVUtils`
[ https://issues.apache.org/jira/browse/SPARK-38847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-38847. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36132 [https://github.com/apache/spark/pull/36132] > Introduce a `viewToSeq` function for `KVUtils` > -- > > Key: SPARK-38847 > URL: https://issues.apache.org/jira/browse/SPARK-38847 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > There are many codes in spark that convert KVStoreView into `List`, and these > codes will not close `KVStoreIterator`, these resources are mainly recycled > by `finalize()` method implemented in `LevelDB` and `RockSB`, this makes > `KVStoreIterator` resource recycling unpredictable. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38689) Use error classes in the compilation errors of not allowed DESC PARTITION
[ https://issues.apache.org/jira/browse/SPARK-38689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521227#comment-17521227 ] Apache Spark commented on SPARK-38689: -- User 'ivoson' has created a pull request for this issue: https://github.com/apache/spark/pull/36163 > Use error classes in the compilation errors of not allowed DESC PARTITION > - > > Key: SPARK-38689 > URL: https://issues.apache.org/jira/browse/SPARK-38689 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryCompilationErrors: > * descPartitionNotAllowedOnTempView > * descPartitionNotAllowedOnView > * descPartitionNotAllowedOnViewError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38689) Use error classes in the compilation errors of not allowed DESC PARTITION
[ https://issues.apache.org/jira/browse/SPARK-38689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38689: Assignee: Apache Spark > Use error classes in the compilation errors of not allowed DESC PARTITION > - > > Key: SPARK-38689 > URL: https://issues.apache.org/jira/browse/SPARK-38689 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Migrate the following errors in QueryCompilationErrors: > * descPartitionNotAllowedOnTempView > * descPartitionNotAllowedOnView > * descPartitionNotAllowedOnViewError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38689) Use error classes in the compilation errors of not allowed DESC PARTITION
[ https://issues.apache.org/jira/browse/SPARK-38689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38689: Assignee: (was: Apache Spark) > Use error classes in the compilation errors of not allowed DESC PARTITION > - > > Key: SPARK-38689 > URL: https://issues.apache.org/jira/browse/SPARK-38689 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryCompilationErrors: > * descPartitionNotAllowedOnTempView > * descPartitionNotAllowedOnView > * descPartitionNotAllowedOnViewError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38880) Implement `numeric_only` parameter of `GroupBy.max/min`
Xinrong Meng created SPARK-38880: Summary: Implement `numeric_only` parameter of `GroupBy.max/min` Key: SPARK-38880 URL: https://issues.apache.org/jira/browse/SPARK-38880 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Implement `numeric_only` parameter of `GroupBy.max/min` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38880) Implement `numeric_only` parameter of `GroupBy.max/min`
[ https://issues.apache.org/jira/browse/SPARK-38880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38880: Assignee: (was: Apache Spark) > Implement `numeric_only` parameter of `GroupBy.max/min` > --- > > Key: SPARK-38880 > URL: https://issues.apache.org/jira/browse/SPARK-38880 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Implement `numeric_only` parameter of `GroupBy.max/min` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38880) Implement `numeric_only` parameter of `GroupBy.max/min`
[ https://issues.apache.org/jira/browse/SPARK-38880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38880: Assignee: Apache Spark > Implement `numeric_only` parameter of `GroupBy.max/min` > --- > > Key: SPARK-38880 > URL: https://issues.apache.org/jira/browse/SPARK-38880 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Implement `numeric_only` parameter of `GroupBy.max/min` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38880) Implement `numeric_only` parameter of `GroupBy.max/min`
[ https://issues.apache.org/jira/browse/SPARK-38880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521244#comment-17521244 ] Apache Spark commented on SPARK-38880: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/36148 > Implement `numeric_only` parameter of `GroupBy.max/min` > --- > > Key: SPARK-38880 > URL: https://issues.apache.org/jira/browse/SPARK-38880 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Implement `numeric_only` parameter of `GroupBy.max/min` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38880) Implement `numeric_only` parameter of `GroupBy.max/min`
[ https://issues.apache.org/jira/browse/SPARK-38880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521245#comment-17521245 ] Apache Spark commented on SPARK-38880: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/36148 > Implement `numeric_only` parameter of `GroupBy.max/min` > --- > > Key: SPARK-38880 > URL: https://issues.apache.org/jira/browse/SPARK-38880 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Implement `numeric_only` parameter of `GroupBy.max/min` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36620) Client side related push-based shuffle metrics
[ https://issues.apache.org/jira/browse/SPARK-36620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521276#comment-17521276 ] Apache Spark commented on SPARK-36620: -- User 'thejdeep' has created a pull request for this issue: https://github.com/apache/spark/pull/36165 > Client side related push-based shuffle metrics > -- > > Key: SPARK-36620 > URL: https://issues.apache.org/jira/browse/SPARK-36620 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: Thejdeep Gudivada >Priority: Major > > Need to add client side related metrics to push-based shuffle. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36620) Client side related push-based shuffle metrics
[ https://issues.apache.org/jira/browse/SPARK-36620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521277#comment-17521277 ] Apache Spark commented on SPARK-36620: -- User 'thejdeep' has created a pull request for this issue: https://github.com/apache/spark/pull/36165 > Client side related push-based shuffle metrics > -- > > Key: SPARK-36620 > URL: https://issues.apache.org/jira/browse/SPARK-36620 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: Thejdeep Gudivada >Priority: Major > > Need to add client side related metrics to push-based shuffle. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38881) PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that is already supported in the Scala/Java APIs
Mark Khaitman created SPARK-38881: - Summary: PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that is already supported in the Scala/Java APIs Key: SPARK-38881 URL: https://issues.apache.org/jira/browse/SPARK-38881 Project: Spark Issue Type: Improvement Components: DStreams, Input/Output, PySpark Affects Versions: 3.2.1 Reporter: Mark Khaitman This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was merged as part of Spark 3.0.0 This change is desirable as it further exposes the metricsLevel config parameter that was added for the Scala/Java Spark APIs when working with the Kinesis Streaming integration, and makes it available to the PySpark API as well. This change passes all tests, and local testing was done with a development Kinesis stream in AWS, in order to confirm that metrics were no longer being reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark Kinesis streaming context creation, and also worked as it does today when leaving the MetricsLevel parameter out, which would result in a default of DETAILED, with CloudWatch metrics appearing again. I plan to open the PR from my forked repo shortly for further discussion if required. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38881) PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that is already supported in the Scala/Java APIs
[ https://issues.apache.org/jira/browse/SPARK-38881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38881: Assignee: (was: Apache Spark) > PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that > is already supported in the Scala/Java APIs > --- > > Key: SPARK-38881 > URL: https://issues.apache.org/jira/browse/SPARK-38881 > Project: Spark > Issue Type: Improvement > Components: DStreams, Input/Output, PySpark >Affects Versions: 3.2.1 >Reporter: Mark Khaitman >Priority: Major > > This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was > merged as part of Spark 3.0.0 > This change is desirable as it further exposes the metricsLevel config > parameter that was added for the Scala/Java Spark APIs when working with the > Kinesis Streaming integration, and makes it available to the PySpark API as > well. > This change passes all tests, and local testing was done with a development > Kinesis stream in AWS, in order to confirm that metrics were no longer being > reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark > Kinesis streaming context creation, and also worked as it does today when > leaving the MetricsLevel parameter out, which would result in a default of > DETAILED, with CloudWatch metrics appearing again. > https://github.com/apache/spark/pull/36166 > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38881) PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that is already supported in the Scala/Java APIs
[ https://issues.apache.org/jira/browse/SPARK-38881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38881: Assignee: Apache Spark > PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that > is already supported in the Scala/Java APIs > --- > > Key: SPARK-38881 > URL: https://issues.apache.org/jira/browse/SPARK-38881 > Project: Spark > Issue Type: Improvement > Components: DStreams, Input/Output, PySpark >Affects Versions: 3.2.1 >Reporter: Mark Khaitman >Assignee: Apache Spark >Priority: Major > > This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was > merged as part of Spark 3.0.0 > This change is desirable as it further exposes the metricsLevel config > parameter that was added for the Scala/Java Spark APIs when working with the > Kinesis Streaming integration, and makes it available to the PySpark API as > well. > This change passes all tests, and local testing was done with a development > Kinesis stream in AWS, in order to confirm that metrics were no longer being > reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark > Kinesis streaming context creation, and also worked as it does today when > leaving the MetricsLevel parameter out, which would result in a default of > DETAILED, with CloudWatch metrics appearing again. > https://github.com/apache/spark/pull/36166 > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38881) PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that is already supported in the Scala/Java APIs
[ https://issues.apache.org/jira/browse/SPARK-38881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Khaitman updated SPARK-38881: -- Description: This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was merged as part of Spark 3.0.0 This change is desirable as it further exposes the metricsLevel config parameter that was added for the Scala/Java Spark APIs when working with the Kinesis Streaming integration, and makes it available to the PySpark API as well. This change passes all tests, and local testing was done with a development Kinesis stream in AWS, in order to confirm that metrics were no longer being reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark Kinesis streaming context creation, and also worked as it does today when leaving the MetricsLevel parameter out, which would result in a default of DETAILED, with CloudWatch metrics appearing again. https://github.com/apache/spark/pull/36166 was: This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was merged as part of Spark 3.0.0 This change is desirable as it further exposes the metricsLevel config parameter that was added for the Scala/Java Spark APIs when working with the Kinesis Streaming integration, and makes it available to the PySpark API as well. This change passes all tests, and local testing was done with a development Kinesis stream in AWS, in order to confirm that metrics were no longer being reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark Kinesis streaming context creation, and also worked as it does today when leaving the MetricsLevel parameter out, which would result in a default of DETAILED, with CloudWatch metrics appearing again. I plan to open the PR from my forked repo shortly for further discussion if required. > PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that > is already supported in the Scala/Java APIs > --- > > Key: SPARK-38881 > URL: https://issues.apache.org/jira/browse/SPARK-38881 > Project: Spark > Issue Type: Improvement > Components: DStreams, Input/Output, PySpark >Affects Versions: 3.2.1 >Reporter: Mark Khaitman >Priority: Major > > This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was > merged as part of Spark 3.0.0 > This change is desirable as it further exposes the metricsLevel config > parameter that was added for the Scala/Java Spark APIs when working with the > Kinesis Streaming integration, and makes it available to the PySpark API as > well. > This change passes all tests, and local testing was done with a development > Kinesis stream in AWS, in order to confirm that metrics were no longer being > reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark > Kinesis streaming context creation, and also worked as it does today when > leaving the MetricsLevel parameter out, which would result in a default of > DETAILED, with CloudWatch metrics appearing again. > https://github.com/apache/spark/pull/36166 > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38881) PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that is already supported in the Scala/Java APIs
[ https://issues.apache.org/jira/browse/SPARK-38881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521293#comment-17521293 ] Apache Spark commented on SPARK-38881: -- User 'mkman84' has created a pull request for this issue: https://github.com/apache/spark/pull/36166 > PySpark Kinesis Streaming should expose metricsLevel CloudWatch config that > is already supported in the Scala/Java APIs > --- > > Key: SPARK-38881 > URL: https://issues.apache.org/jira/browse/SPARK-38881 > Project: Spark > Issue Type: Improvement > Components: DStreams, Input/Output, PySpark >Affects Versions: 3.2.1 >Reporter: Mark Khaitman >Priority: Major > > This relates to https://issues.apache.org/jira/browse/SPARK-27420 which was > merged as part of Spark 3.0.0 > This change is desirable as it further exposes the metricsLevel config > parameter that was added for the Scala/Java Spark APIs when working with the > Kinesis Streaming integration, and makes it available to the PySpark API as > well. > This change passes all tests, and local testing was done with a development > Kinesis stream in AWS, in order to confirm that metrics were no longer being > reported to CloudWatch after specifying MetricsLevel.NONE in the PySpark > Kinesis streaming context creation, and also worked as it does today when > leaving the MetricsLevel parameter out, which would result in a default of > DETAILED, with CloudWatch metrics appearing again. > https://github.com/apache/spark/pull/36166 > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38767) Support ignoreCorruptFiles and ignoreMissingFiles in Data Source options
[ https://issues.apache.org/jira/browse/SPARK-38767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38767: - Assignee: Yaohua Cui > Support ignoreCorruptFiles and ignoreMissingFiles in Data Source options > > > Key: SPARK-38767 > URL: https://issues.apache.org/jira/browse/SPARK-38767 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yaohua Zhao >Assignee: Yaohua Cui >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38767) Support ignoreCorruptFiles and ignoreMissingFiles in Data Source options
[ https://issues.apache.org/jira/browse/SPARK-38767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38767. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36069 [https://github.com/apache/spark/pull/36069] > Support ignoreCorruptFiles and ignoreMissingFiles in Data Source options > > > Key: SPARK-38767 > URL: https://issues.apache.org/jira/browse/SPARK-38767 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yaohua Zhao >Assignee: Yaohua Cui >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38767) Support ignoreCorruptFiles and ignoreMissingFiles in Data Source options
[ https://issues.apache.org/jira/browse/SPARK-38767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38767: - Assignee: Yaohua Zhao (was: Yaohua Cui) > Support ignoreCorruptFiles and ignoreMissingFiles in Data Source options > > > Key: SPARK-38767 > URL: https://issues.apache.org/jira/browse/SPARK-38767 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yaohua Zhao >Assignee: Yaohua Zhao >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38792) Regression in time executor takes to do work sometime after v3.0.1 ?
[ https://issues.apache.org/jira/browse/SPARK-38792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521342#comment-17521342 ] Danny Guinther commented on SPARK-38792: Where does org.apache.spark.sql.execution.collect.Collector live? I can't find it an New Relic suggests that the problem may stem from some classes in org.apache.spark.sql.execution.collect.* See attached screenshot named what-is-this-code.jpg > Regression in time executor takes to do work sometime after v3.0.1 ? > > > Key: SPARK-38792 > URL: https://issues.apache.org/jira/browse/SPARK-38792 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Danny Guinther >Priority: Major > Attachments: dummy-job-job.jpg, dummy-job-query.png, > executor-timing-debug-number-2.jpg, executor-timing-debug-number-4.jpg, > executor-timing-debug-number-5.jpg, min-time-way-up.jpg, > what-s-up-with-exec-actions.jpg > > > Hello! > I'm sorry to trouble you with this, but I'm seeing a noticeable regression in > performance when upgrading from 3.0.1 to 3.2.1 and I can't pin down why. I > don't believe it is specific to my application since the upgrade to 3.0.1 to > 3.2.1 is purely a configuration change. I'd guess it presents itself in my > application due to the high volume of work my application does, but I could > be mistaken. > The gist is that it seems like the executor actions I'm running suddenly > appear to take a lot longer on Spark 3.2.1. I don't have any ability to test > versions between 3.0.1 and 3.2.1 because my application was previously > blocked from upgrading beyond Spark 3.0.1 by > https://issues.apache.org/jira/browse/SPARK-37391 (which I helped to fix). > Any ideas what might cause this or metrics I might try to gather to pinpoint > the problem? I've tried a bunch of the suggestions from > [https://spark.apache.org/docs/latest/tuning.html] to see if any of those > help, but none of the adjustments I've tried have been fruitful. I also tried > to look in [https://spark.apache.org/docs/latest/sql-migration-guide.html] > for ideas as to what might have changed to cause this behavior, but haven't > seen anything that sticks out as being a possible source of the problem. > I have attached a graph that shows the drastic change in time taken by > executor actions. In the image the blue and purple lines are different kinds > of reads using the built-in JDBC data reader and the green line is writes > using a custom-built data writer. The deploy to switch from 3.0.1 to 3.2.1 > occurred at 9AM on the graph. The graph data comes from timing blocks that > surround only the calls to dataframe actions, so there shouldn't be anything > specific to my application that is suddenly inflating these numbers. The > specific actions I'm invoking are: count() (but there's some transforming and > caching going on, so it's really more than that); first(); and write(). > The driver process does seem to be seeing more GC churn then with Spark > 3.0.1, but I don't think that explains this behavior. The executors don't > seem to have any problem with memory or GC and are not overutilized (our > pipeline is very read and write heavy, less heavy on transformations, so > executors tend to be idle while waiting for various network I/O). > > Thanks in advance for any help! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38792) Regression in time executor takes to do work sometime after v3.0.1 ?
[ https://issues.apache.org/jira/browse/SPARK-38792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Guinther updated SPARK-38792: --- Attachment: what-is-this-code.jpg > Regression in time executor takes to do work sometime after v3.0.1 ? > > > Key: SPARK-38792 > URL: https://issues.apache.org/jira/browse/SPARK-38792 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Danny Guinther >Priority: Major > Attachments: dummy-job-job.jpg, dummy-job-query.png, > executor-timing-debug-number-2.jpg, executor-timing-debug-number-4.jpg, > executor-timing-debug-number-5.jpg, min-time-way-up.jpg, > what-is-this-code.jpg, what-s-up-with-exec-actions.jpg > > > Hello! > I'm sorry to trouble you with this, but I'm seeing a noticeable regression in > performance when upgrading from 3.0.1 to 3.2.1 and I can't pin down why. I > don't believe it is specific to my application since the upgrade to 3.0.1 to > 3.2.1 is purely a configuration change. I'd guess it presents itself in my > application due to the high volume of work my application does, but I could > be mistaken. > The gist is that it seems like the executor actions I'm running suddenly > appear to take a lot longer on Spark 3.2.1. I don't have any ability to test > versions between 3.0.1 and 3.2.1 because my application was previously > blocked from upgrading beyond Spark 3.0.1 by > https://issues.apache.org/jira/browse/SPARK-37391 (which I helped to fix). > Any ideas what might cause this or metrics I might try to gather to pinpoint > the problem? I've tried a bunch of the suggestions from > [https://spark.apache.org/docs/latest/tuning.html] to see if any of those > help, but none of the adjustments I've tried have been fruitful. I also tried > to look in [https://spark.apache.org/docs/latest/sql-migration-guide.html] > for ideas as to what might have changed to cause this behavior, but haven't > seen anything that sticks out as being a possible source of the problem. > I have attached a graph that shows the drastic change in time taken by > executor actions. In the image the blue and purple lines are different kinds > of reads using the built-in JDBC data reader and the green line is writes > using a custom-built data writer. The deploy to switch from 3.0.1 to 3.2.1 > occurred at 9AM on the graph. The graph data comes from timing blocks that > surround only the calls to dataframe actions, so there shouldn't be anything > specific to my application that is suddenly inflating these numbers. The > specific actions I'm invoking are: count() (but there's some transforming and > caching going on, so it's really more than that); first(); and write(). > The driver process does seem to be seeing more GC churn then with Spark > 3.0.1, but I don't think that explains this behavior. The executors don't > seem to have any problem with memory or GC and are not overutilized (our > pipeline is very read and write heavy, less heavy on transformations, so > executors tend to be idle while waiting for various network I/O). > > Thanks in advance for any help! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38882) The usage logger attachment logic should handle static methods properly.
Takuya Ueshin created SPARK-38882: - Summary: The usage logger attachment logic should handle static methods properly. Key: SPARK-38882 URL: https://issues.apache.org/jira/browse/SPARK-38882 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.1, 3.3.0 Reporter: Takuya Ueshin The usage logger attachment logic has an issue when handling static methods. For example, {code} $ PYSPARK_PANDAS_USAGE_LOGGER=pyspark.pandas.usage_logging.usage_logger ./bin/pyspark {code} {code:python} >>> import pyspark.pandas as ps >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]}) >>> psdf.from_records([(1, 2), (3, 4)]) A function `DataFrame.from_records(data, index, exclude, columns, coerce_float, nrows)` was failed after 2007.430 ms: 0 Traceback (most recent call last): ... {code} without usage logger: {code:python} >>> import pyspark.pandas as ps >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]}) >>> psdf.from_records([(1, 2), (3, 4)]) 0 1 0 1 2 1 3 4 {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38882) The usage logger attachment logic should handle static methods properly.
[ https://issues.apache.org/jira/browse/SPARK-38882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521355#comment-17521355 ] Apache Spark commented on SPARK-38882: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/36167 > The usage logger attachment logic should handle static methods properly. > > > Key: SPARK-38882 > URL: https://issues.apache.org/jira/browse/SPARK-38882 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1, 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > The usage logger attachment logic has an issue when handling static methods. > For example, > {code} > $ PYSPARK_PANDAS_USAGE_LOGGER=pyspark.pandas.usage_logging.usage_logger > ./bin/pyspark > {code} > {code:python} > >>> import pyspark.pandas as ps > >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]}) > >>> psdf.from_records([(1, 2), (3, 4)]) > A function `DataFrame.from_records(data, index, exclude, columns, > coerce_float, nrows)` was failed after 2007.430 ms: 0 > Traceback (most recent call last): > ... > {code} > without usage logger: > {code:python} > >>> import pyspark.pandas as ps > >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]}) > >>> psdf.from_records([(1, 2), (3, 4)]) >0 1 > 0 1 2 > 1 3 4 > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38882) The usage logger attachment logic should handle static methods properly.
[ https://issues.apache.org/jira/browse/SPARK-38882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38882: Assignee: (was: Apache Spark) > The usage logger attachment logic should handle static methods properly. > > > Key: SPARK-38882 > URL: https://issues.apache.org/jira/browse/SPARK-38882 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1, 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > The usage logger attachment logic has an issue when handling static methods. > For example, > {code} > $ PYSPARK_PANDAS_USAGE_LOGGER=pyspark.pandas.usage_logging.usage_logger > ./bin/pyspark > {code} > {code:python} > >>> import pyspark.pandas as ps > >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]}) > >>> psdf.from_records([(1, 2), (3, 4)]) > A function `DataFrame.from_records(data, index, exclude, columns, > coerce_float, nrows)` was failed after 2007.430 ms: 0 > Traceback (most recent call last): > ... > {code} > without usage logger: > {code:python} > >>> import pyspark.pandas as ps > >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]}) > >>> psdf.from_records([(1, 2), (3, 4)]) >0 1 > 0 1 2 > 1 3 4 > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38882) The usage logger attachment logic should handle static methods properly.
[ https://issues.apache.org/jira/browse/SPARK-38882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38882: Assignee: Apache Spark > The usage logger attachment logic should handle static methods properly. > > > Key: SPARK-38882 > URL: https://issues.apache.org/jira/browse/SPARK-38882 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1, 3.3.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > The usage logger attachment logic has an issue when handling static methods. > For example, > {code} > $ PYSPARK_PANDAS_USAGE_LOGGER=pyspark.pandas.usage_logging.usage_logger > ./bin/pyspark > {code} > {code:python} > >>> import pyspark.pandas as ps > >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]}) > >>> psdf.from_records([(1, 2), (3, 4)]) > A function `DataFrame.from_records(data, index, exclude, columns, > coerce_float, nrows)` was failed after 2007.430 ms: 0 > Traceback (most recent call last): > ... > {code} > without usage logger: > {code:python} > >>> import pyspark.pandas as ps > >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]}) > >>> psdf.from_records([(1, 2), (3, 4)]) >0 1 > 0 1 2 > 1 3 4 > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38852) Better Data Source V2 operator pushdown framework
[ https://issues.apache.org/jira/browse/SPARK-38852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521364#comment-17521364 ] Erik Krogen commented on SPARK-38852: - What's the relationship between this and SPARK-38788? Seems like they are laying out the same goal? > Better Data Source V2 operator pushdown framework > - > > Key: SPARK-38852 > URL: https://issues.apache.org/jira/browse/SPARK-38852 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > Currently, Spark supports push down Filters and Aggregates to data source. > However, the Data Source V2 operator pushdown framework has the following > shortcomings: > # Only simple filter and aggregate are supported, which makes it impossible > to apply in most scenarios > # The incompatibility of SQL syntax makes it impossible to apply in most > scenarios > # Aggregate push down does not support multiple partitions of data sources > # Spark's additional aggregate will cause some overhead > # Limit push down is not supported > # Top n push down is not supported > # Offset push down is not supported -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38788) More comprehensive DSV2 push down capabilities
[ https://issues.apache.org/jira/browse/SPARK-38788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521363#comment-17521363 ] Erik Krogen commented on SPARK-38788: - What's the relationship between this and SPARK-38852? Seems like they are laying out the same goal? > More comprehensive DSV2 push down capabilities > -- > > Key: SPARK-38788 > URL: https://issues.apache.org/jira/browse/SPARK-38788 > Project: Spark > Issue Type: Epic > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Get together all tickets related to push down (filters) via Datasource V2 > APIs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38882) The usage logger attachment logic should handle static methods properly.
[ https://issues.apache.org/jira/browse/SPARK-38882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38882. -- Resolution: Fixed Fixed in https://github.com/apache/spark/pull/36167 > The usage logger attachment logic should handle static methods properly. > > > Key: SPARK-38882 > URL: https://issues.apache.org/jira/browse/SPARK-38882 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1, 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > > The usage logger attachment logic has an issue when handling static methods. > For example, > {code} > $ PYSPARK_PANDAS_USAGE_LOGGER=pyspark.pandas.usage_logging.usage_logger > ./bin/pyspark > {code} > {code:python} > >>> import pyspark.pandas as ps > >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]}) > >>> psdf.from_records([(1, 2), (3, 4)]) > A function `DataFrame.from_records(data, index, exclude, columns, > coerce_float, nrows)` was failed after 2007.430 ms: 0 > Traceback (most recent call last): > ... > {code} > without usage logger: > {code:python} > >>> import pyspark.pandas as ps > >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]}) > >>> psdf.from_records([(1, 2), (3, 4)]) >0 1 > 0 1 2 > 1 3 4 > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38882) The usage logger attachment logic should handle static methods properly.
[ https://issues.apache.org/jira/browse/SPARK-38882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38882: - Fix Version/s: 3.3.0 > The usage logger attachment logic should handle static methods properly. > > > Key: SPARK-38882 > URL: https://issues.apache.org/jira/browse/SPARK-38882 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1, 3.3.0 >Reporter: Takuya Ueshin >Priority: Major > Fix For: 3.3.0 > > > The usage logger attachment logic has an issue when handling static methods. > For example, > {code} > $ PYSPARK_PANDAS_USAGE_LOGGER=pyspark.pandas.usage_logging.usage_logger > ./bin/pyspark > {code} > {code:python} > >>> import pyspark.pandas as ps > >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]}) > >>> psdf.from_records([(1, 2), (3, 4)]) > A function `DataFrame.from_records(data, index, exclude, columns, > coerce_float, nrows)` was failed after 2007.430 ms: 0 > Traceback (most recent call last): > ... > {code} > without usage logger: > {code:python} > >>> import pyspark.pandas as ps > >>> psdf = ps.DataFrame({"a": [1,2,3], "b": [4,5,6]}) > >>> psdf.from_records([(1, 2), (3, 4)]) >0 1 > 0 1 2 > 1 3 4 > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38878) CLONE - Improve the test coverage for pyspark/statcounter.py
[ https://issues.apache.org/jira/browse/SPARK-38878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38878. -- Resolution: Invalid > CLONE - Improve the test coverage for pyspark/statcounter.py > > > Key: SPARK-38878 > URL: https://issues.apache.org/jira/browse/SPARK-38878 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.4.0 > > > Improve the test coverage of statcounter.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38876) CLONE - Improve the test coverage for pyspark/*.py
[ https://issues.apache.org/jira/browse/SPARK-38876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38876. -- Resolution: Invalid > CLONE - Improve the test coverage for pyspark/*.py > -- > > Key: SPARK-38876 > URL: https://issues.apache.org/jira/browse/SPARK-38876 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Major > > Currently, there are several Python scripts under pyspark/ directory. (e.g. > rdd.py, util.py, serializers.py, ...) > We could improve the test coverage by adding the missing tests for these > scripts. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38877) CLONE - Improve the test coverage for pyspark/find_spark_home.py
[ https://issues.apache.org/jira/browse/SPARK-38877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38877. -- Resolution: Invalid > CLONE - Improve the test coverage for pyspark/find_spark_home.py > > > Key: SPARK-38877 > URL: https://issues.apache.org/jira/browse/SPARK-38877 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: pralabhkumar >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.4.0 > > > We should test when the environment variables are not set > (https://app.codecov.io/gh/apache/spark/blob/master/python/pyspark/find_spark_home.py) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38875) CLONE - Improve the test coverage for pyspark/sql module
[ https://issues.apache.org/jira/browse/SPARK-38875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38875. -- Resolution: Invalid > CLONE - Improve the test coverage for pyspark/sql module > > > Key: SPARK-38875 > URL: https://issues.apache.org/jira/browse/SPARK-38875 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Major > > Currently, sql module has 90% of test coverage. > We could improve the test coverage by adding the missing tests for sql module. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38872) CLONE - Improve the test coverage for pyspark/pandas module
[ https://issues.apache.org/jira/browse/SPARK-38872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38872. -- Resolution: Invalid > CLONE - Improve the test coverage for pyspark/pandas module > --- > > Key: SPARK-38872 > URL: https://issues.apache.org/jira/browse/SPARK-38872 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Major > > Currently, pandas module (pandas API on Spark) has 94% of test coverage. > We could improve the test coverage by adding the missing tests for pandas > module. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38873) CLONE - Improve the test coverage for pyspark/mllib module
[ https://issues.apache.org/jira/browse/SPARK-38873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38873. -- Resolution: Invalid > CLONE - Improve the test coverage for pyspark/mllib module > -- > > Key: SPARK-38873 > URL: https://issues.apache.org/jira/browse/SPARK-38873 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Major > > Currently, mllib module has 88% of test coverage. > We could improve the test coverage by adding the missing tests for mllib > module. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38874) CLONE - Improve the test coverage for pyspark/ml module
[ https://issues.apache.org/jira/browse/SPARK-38874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38874. -- Resolution: Invalid > CLONE - Improve the test coverage for pyspark/ml module > --- > > Key: SPARK-38874 > URL: https://issues.apache.org/jira/browse/SPARK-38874 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Major > > Currently, ml module has 90% of test coverage. > We could improve the test coverage by adding the missing tests for ml module. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py
[ https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-38879: Assignee: (was: Hyukjin Kwon) > Improve the test coverage for pyspark/rddsampler.py > --- > > Key: SPARK-38879 > URL: https://issues.apache.org/jira/browse/SPARK-38879 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > Fix For: 3.4.0 > > > Improve the test coverage of rddsampler.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py
[ https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521398#comment-17521398 ] Hyukjin Kwon commented on SPARK-38879: -- [~pralabhkumar] please just go ahead. no need to ask :-). > Improve the test coverage for pyspark/rddsampler.py > --- > > Key: SPARK-38879 > URL: https://issues.apache.org/jira/browse/SPARK-38879 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.4.0 > > > Improve the test coverage of rddsampler.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38854) Improve the test coverage for pyspark/statcounter.py
[ https://issues.apache.org/jira/browse/SPARK-38854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-38854: Assignee: pralabhkumar (was: Hyukjin Kwon) > Improve the test coverage for pyspark/statcounter.py > > > Key: SPARK-38854 > URL: https://issues.apache.org/jira/browse/SPARK-38854 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Assignee: pralabhkumar >Priority: Minor > Fix For: 3.4.0 > > > Improve the test coverage of statcounter.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38879) Improve the test coverage for pyspark/rddsampler.py
[ https://issues.apache.org/jira/browse/SPARK-38879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38879: - Fix Version/s: (was: 3.4.0) > Improve the test coverage for pyspark/rddsampler.py > --- > > Key: SPARK-38879 > URL: https://issues.apache.org/jira/browse/SPARK-38879 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: pralabhkumar >Priority: Minor > > Improve the test coverage of rddsampler.py -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38822) Raise indexError when insert loc is out of bounds
[ https://issues.apache.org/jira/browse/SPARK-38822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-38822: Assignee: Yikun Jiang > Raise indexError when insert loc is out of bounds > - > > Key: SPARK-38822 > URL: https://issues.apache.org/jira/browse/SPARK-38822 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > > > [https://github.com/apache/spark/blob/becda3339381b3975ed567c156260eda036d7a1b/python/pyspark/pandas/tests/indexes/test_base.py#L2179] > > we need to raise indexError when out of bounds for axis, and also change test > case > > - Related changes: > - panda 1.4.0+ is using numpy insert: > https://github.com/pandas-dev/pandas/commit/c021d33ecf0e096a186edb731964767e9288a875 > - Since numpy 1.8 (10 years ago > https://github.com/numpy/numpy/commit/908e06c3c465434023649b0ca522836580c5cfdc) > : [`out-of-bound indices will generate an > error.`](https://numpy.org/devdocs/release/1.8.0-notes.html#changes) > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38822) Raise indexError when insert loc is out of bounds
[ https://issues.apache.org/jira/browse/SPARK-38822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38822. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36115 [https://github.com/apache/spark/pull/36115] > Raise indexError when insert loc is out of bounds > - > > Key: SPARK-38822 > URL: https://issues.apache.org/jira/browse/SPARK-38822 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > > [https://github.com/apache/spark/blob/becda3339381b3975ed567c156260eda036d7a1b/python/pyspark/pandas/tests/indexes/test_base.py#L2179] > > we need to raise indexError when out of bounds for axis, and also change test > case > > - Related changes: > - panda 1.4.0+ is using numpy insert: > https://github.com/pandas-dev/pandas/commit/c021d33ecf0e096a186edb731964767e9288a875 > - Since numpy 1.8 (10 years ago > https://github.com/numpy/numpy/commit/908e06c3c465434023649b0ca522836580c5cfdc) > : [`out-of-bound indices will generate an > error.`](https://numpy.org/devdocs/release/1.8.0-notes.html#changes) > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38822) Raise indexError when insert loc is out of bounds
[ https://issues.apache.org/jira/browse/SPARK-38822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521411#comment-17521411 ] Apache Spark commented on SPARK-38822: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/36168 > Raise indexError when insert loc is out of bounds > - > > Key: SPARK-38822 > URL: https://issues.apache.org/jira/browse/SPARK-38822 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > > [https://github.com/apache/spark/blob/becda3339381b3975ed567c156260eda036d7a1b/python/pyspark/pandas/tests/indexes/test_base.py#L2179] > > we need to raise indexError when out of bounds for axis, and also change test > case > > - Related changes: > - panda 1.4.0+ is using numpy insert: > https://github.com/pandas-dev/pandas/commit/c021d33ecf0e096a186edb731964767e9288a875 > - Since numpy 1.8 (10 years ago > https://github.com/numpy/numpy/commit/908e06c3c465434023649b0ca522836580c5cfdc) > : [`out-of-bound indices will generate an > error.`](https://numpy.org/devdocs/release/1.8.0-notes.html#changes) > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38822) Raise indexError when insert loc is out of bounds
[ https://issues.apache.org/jira/browse/SPARK-38822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521412#comment-17521412 ] Apache Spark commented on SPARK-38822: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/36168 > Raise indexError when insert loc is out of bounds > - > > Key: SPARK-38822 > URL: https://issues.apache.org/jira/browse/SPARK-38822 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > > [https://github.com/apache/spark/blob/becda3339381b3975ed567c156260eda036d7a1b/python/pyspark/pandas/tests/indexes/test_base.py#L2179] > > we need to raise indexError when out of bounds for axis, and also change test > case > > - Related changes: > - panda 1.4.0+ is using numpy insert: > https://github.com/pandas-dev/pandas/commit/c021d33ecf0e096a186edb731964767e9288a875 > - Since numpy 1.8 (10 years ago > https://github.com/numpy/numpy/commit/908e06c3c465434023649b0ca522836580c5cfdc) > : [`out-of-bound indices will generate an > error.`](https://numpy.org/devdocs/release/1.8.0-notes.html#changes) > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38823) Incorrect result of dataset reduceGroups in java
[ https://issues.apache.org/jira/browse/SPARK-38823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521414#comment-17521414 ] Bruce Robbins commented on SPARK-38823: --- This appears to be an optimization bug that results in corruption of the buffers in {{AggregationIterator}}. On master and 3.3, {{NewInstance}} with no arguments is considered foldable. As a result, the {{ConstantFolding}} rule turns NewInstance into a Literal holding an instance of the user's specified Java bean. The instance becomes a singleton that gets reused for each input record (although its fields get updated by {{InitializeJavaBean}}). Because the instance gets reused, sometimes multiple buffers in {{AggregationIterator}} are actually referring to the same Java bean instance. Take, for example, the test I added [here|https://github.com/bersprockets/spark/blob/17a8ad64f5bc39cb26d25b63f3692e7b8632baf8/sql/core/src/test/java/test/org/apache/spark/sql/JavaBeanDeserializationSuite.java#L560]. The input is: {noformat} List items = Arrays.asList( new Item("a", 1), new Item("b", 3), new Item("c", 2), new Item("a", 7)); {noformat} As {{ObjectAggregationIterator}} reads the input, the buffers get set up as follows (note that the first field of Item should be the same as the key): {noformat} - Read Item("a", 1) - Buffers are now: Key "a" --> Item("a", 1) - Read Item("b", 3) - Buffers are now: Key "a" -> Item("b", 3) Key "b" -> Item("b", 3) {noformat} The buffer for key "a" now contains Item("b", 3). That's because both buffers contain a reference to the same Item instance, and that Item instance's fields were updated when {{Item("b", 3)}} was read. When {{AggregateIterator}} finally calls the test's reduce function, it will pass the same Item instance ({{Item("a", 7)}}) as both the buffer and the input record. At that point, the buffers for "a", "b", and "c" will all contain {{Item("a", 7)}}. I _think_ the fix for this is to make {{NewInstance}} non-foldable. My linked test passes with that change (and fails without it). I will run the unit tests and hopefully make a PR tomorrow, assuming the proposed fix doesn't break something else besides {{ConstantFoldingSuite}}. > Incorrect result of dataset reduceGroups in java > > > Key: SPARK-38823 > URL: https://issues.apache.org/jira/browse/SPARK-38823 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.4.0 >Reporter: IKozar >Priority: Major > > {code:java} > @Data > @NoArgsConstructor > @AllArgsConstructor > public static class Item implements Serializable { > private String x; > private String y; > private int z; > public Item addZ(int z) { > return new Item(x, y, this.z + z); > } > } {code} > {code:java} > List items = List.of( > new Item("X1", "Y1", 1), > new Item("X2", "Y1", 1), > new Item("X1", "Y1", 1), > new Item("X2", "Y1", 1), > new Item("X3", "Y1", 1), > new Item("X1", "Y1", 1), > new Item("X1", "Y2", 1), > new Item("X2", "Y1", 1)); > Dataset ds = spark.createDataFrame(items, > Item.class).as(Encoders.bean(Item.class)); > ds.groupByKey((MapFunction>) item -> > Tuple2.apply(item.getX(), item.getY()), > Encoders.tuple(Encoders.STRING(), Encoders.STRING())) > .reduceGroups((ReduceFunction) (item1, item2) -> > item1.addZ(item2.getZ())) > .show(10); > {code} > result is > {noformat} > ++--+ > | key|ReduceAggregator(poc.job.JavaSparkReduce$Item)| > ++--+ > |{X1, Y1}| {X2, Y1, 2}|-- expected 3 > |{X2, Y1}| {X2, Y1, 2}|-- expected 3 > |{X1, Y2}| {X2, Y1, 1}| > |{X3, Y1}| {X2, Y1, 1}| > ++--+{noformat} > pay attention that key doesn't mach with value -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38823) Incorrect result of dataset reduceGroups in java
[ https://issues.apache.org/jira/browse/SPARK-38823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-38823: -- Labels: correctness (was: ) > Incorrect result of dataset reduceGroups in java > > > Key: SPARK-38823 > URL: https://issues.apache.org/jira/browse/SPARK-38823 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.4.0 >Reporter: IKozar >Priority: Major > Labels: correctness > > {code:java} > @Data > @NoArgsConstructor > @AllArgsConstructor > public static class Item implements Serializable { > private String x; > private String y; > private int z; > public Item addZ(int z) { > return new Item(x, y, this.z + z); > } > } {code} > {code:java} > List items = List.of( > new Item("X1", "Y1", 1), > new Item("X2", "Y1", 1), > new Item("X1", "Y1", 1), > new Item("X2", "Y1", 1), > new Item("X3", "Y1", 1), > new Item("X1", "Y1", 1), > new Item("X1", "Y2", 1), > new Item("X2", "Y1", 1)); > Dataset ds = spark.createDataFrame(items, > Item.class).as(Encoders.bean(Item.class)); > ds.groupByKey((MapFunction>) item -> > Tuple2.apply(item.getX(), item.getY()), > Encoders.tuple(Encoders.STRING(), Encoders.STRING())) > .reduceGroups((ReduceFunction) (item1, item2) -> > item1.addZ(item2.getZ())) > .show(10); > {code} > result is > {noformat} > ++--+ > | key|ReduceAggregator(poc.job.JavaSparkReduce$Item)| > ++--+ > |{X1, Y1}| {X2, Y1, 2}|-- expected 3 > |{X2, Y1}| {X2, Y1, 2}|-- expected 3 > |{X1, Y2}| {X2, Y1, 1}| > |{X3, Y1}| {X2, Y1, 1}| > ++--+{noformat} > pay attention that key doesn't mach with value -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38823) Incorrect result of dataset reduceGroups in java
[ https://issues.apache.org/jira/browse/SPARK-38823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-38823: -- Affects Version/s: 3.3.0 > Incorrect result of dataset reduceGroups in java > > > Key: SPARK-38823 > URL: https://issues.apache.org/jira/browse/SPARK-38823 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.3.0, 3.4.0 >Reporter: IKozar >Priority: Major > Labels: correctness > > {code:java} > @Data > @NoArgsConstructor > @AllArgsConstructor > public static class Item implements Serializable { > private String x; > private String y; > private int z; > public Item addZ(int z) { > return new Item(x, y, this.z + z); > } > } {code} > {code:java} > List items = List.of( > new Item("X1", "Y1", 1), > new Item("X2", "Y1", 1), > new Item("X1", "Y1", 1), > new Item("X2", "Y1", 1), > new Item("X3", "Y1", 1), > new Item("X1", "Y1", 1), > new Item("X1", "Y2", 1), > new Item("X2", "Y1", 1)); > Dataset ds = spark.createDataFrame(items, > Item.class).as(Encoders.bean(Item.class)); > ds.groupByKey((MapFunction>) item -> > Tuple2.apply(item.getX(), item.getY()), > Encoders.tuple(Encoders.STRING(), Encoders.STRING())) > .reduceGroups((ReduceFunction) (item1, item2) -> > item1.addZ(item2.getZ())) > .show(10); > {code} > result is > {noformat} > ++--+ > | key|ReduceAggregator(poc.job.JavaSparkReduce$Item)| > ++--+ > |{X1, Y1}| {X2, Y1, 2}|-- expected 3 > |{X2, Y1}| {X2, Y1, 2}|-- expected 3 > |{X1, Y2}| {X2, Y1, 1}| > |{X3, Y1}| {X2, Y1, 1}| > ++--+{noformat} > pay attention that key doesn't mach with value -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38883) smaller pyspark install if not using streaming?
t oo created SPARK-38883: Summary: smaller pyspark install if not using streaming? Key: SPARK-38883 URL: https://issues.apache.org/jira/browse/SPARK-38883 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.2.1 Reporter: t oo h3. Describe the feature i am trying to include pyspark in my docker image, but the size is around 300MB the largest jar is rocksdbjni-6.20.3.jar at 35MB is it safe to remove this jar if i have no need for SparkStreaming? is there any advice on getting the install smaller? perhaps a map of which jars are needed for batch vs sql vs streaming? h3. Use Case smaller python package means i can pack more concurrent pods on to my eks workers -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38721) Test the error class: CANNOT_PARSE_DECIMAL
[ https://issues.apache.org/jira/browse/SPARK-38721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38721: Assignee: (was: Apache Spark) > Test the error class: CANNOT_PARSE_DECIMAL > -- > > Key: SPARK-38721 > URL: https://issues.apache.org/jira/browse/SPARK-38721 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Add at least one test for the error class *CANNOT_PARSE_DECIMAL* to > QueryExecutionErrorsSuite. The test should cover the exception throw in > QueryExecutionErrors: > {code:scala} > def cannotParseDecimalError(): Throwable = { > new SparkIllegalStateException(errorClass = "CANNOT_PARSE_DECIMAL", > messageParameters = Array.empty) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38721) Test the error class: CANNOT_PARSE_DECIMAL
[ https://issues.apache.org/jira/browse/SPARK-38721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38721: Assignee: Apache Spark > Test the error class: CANNOT_PARSE_DECIMAL > -- > > Key: SPARK-38721 > URL: https://issues.apache.org/jira/browse/SPARK-38721 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Minor > Labels: starter > > Add at least one test for the error class *CANNOT_PARSE_DECIMAL* to > QueryExecutionErrorsSuite. The test should cover the exception throw in > QueryExecutionErrors: > {code:scala} > def cannotParseDecimalError(): Throwable = { > new SparkIllegalStateException(errorClass = "CANNOT_PARSE_DECIMAL", > messageParameters = Array.empty) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38721) Test the error class: CANNOT_PARSE_DECIMAL
[ https://issues.apache.org/jira/browse/SPARK-38721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521422#comment-17521422 ] Apache Spark commented on SPARK-38721: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/36169 > Test the error class: CANNOT_PARSE_DECIMAL > -- > > Key: SPARK-38721 > URL: https://issues.apache.org/jira/browse/SPARK-38721 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Add at least one test for the error class *CANNOT_PARSE_DECIMAL* to > QueryExecutionErrorsSuite. The test should cover the exception throw in > QueryExecutionErrors: > {code:scala} > def cannotParseDecimalError(): Throwable = { > new SparkIllegalStateException(errorClass = "CANNOT_PARSE_DECIMAL", > messageParameters = Array.empty) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38804) Add StreamingQueryManager.removeListener in PySpark
[ https://issues.apache.org/jira/browse/SPARK-38804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521423#comment-17521423 ] Apache Spark commented on SPARK-38804: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/36170 > Add StreamingQueryManager.removeListener in PySpark > --- > > Key: SPARK-38804 > URL: https://issues.apache.org/jira/browse/SPARK-38804 > Project: Spark > Issue Type: Improvement > Components: PySpark, Structured Streaming >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > SPARK-38759 added StreamingQueryManager.addListener. We should add > removeListener as well. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38721) Test the error class: CANNOT_PARSE_DECIMAL
[ https://issues.apache.org/jira/browse/SPARK-38721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521425#comment-17521425 ] Apache Spark commented on SPARK-38721: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/36169 > Test the error class: CANNOT_PARSE_DECIMAL > -- > > Key: SPARK-38721 > URL: https://issues.apache.org/jira/browse/SPARK-38721 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Add at least one test for the error class *CANNOT_PARSE_DECIMAL* to > QueryExecutionErrorsSuite. The test should cover the exception throw in > QueryExecutionErrors: > {code:scala} > def cannotParseDecimalError(): Throwable = { > new SparkIllegalStateException(errorClass = "CANNOT_PARSE_DECIMAL", > messageParameters = Array.empty) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38804) Add StreamingQueryManager.removeListener in PySpark
[ https://issues.apache.org/jira/browse/SPARK-38804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521427#comment-17521427 ] Apache Spark commented on SPARK-38804: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/36170 > Add StreamingQueryManager.removeListener in PySpark > --- > > Key: SPARK-38804 > URL: https://issues.apache.org/jira/browse/SPARK-38804 > Project: Spark > Issue Type: Improvement > Components: PySpark, Structured Streaming >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > SPARK-38759 added StreamingQueryManager.addListener. We should add > removeListener as well. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org