[jira] [Assigned] (SPARK-46694) Drop the assumptions of 'hive version < 2.0' in Hive version related tests
[ https://issues.apache.org/jira/browse/SPARK-46694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-46694: Assignee: Kent Yao > Drop the assumptions of 'hive version < 2.0' in Hive version related tests > -- > > Key: SPARK-46694 > URL: https://issues.apache.org/jira/browse/SPARK-46694 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46694) Drop the assumptions of 'hive version < 2.0' in Hive version related tests
[ https://issues.apache.org/jira/browse/SPARK-46694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-46694. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44700 [https://github.com/apache/spark/pull/44700] > Drop the assumptions of 'hive version < 2.0' in Hive version related tests > -- > > Key: SPARK-46694 > URL: https://issues.apache.org/jira/browse/SPARK-46694 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46696) In ResourceProfileManager, function calls should occur after variable declarations.
[ https://issues.apache.org/jira/browse/SPARK-46696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46696: --- Labels: pull-request-available (was: ) > In ResourceProfileManager, function calls should occur after variable > declarations. > --- > > Key: SPARK-46696 > URL: https://issues.apache.org/jira/browse/SPARK-46696 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: liangyongyuan >Priority: Major > Labels: pull-request-available > > As the title suggests, in *ResourceProfileManager*, function calls should be > made after variable declarations. When determining *isSupport*, all variables > are uninitialized, with booleans defaulting to false and objects to null. > While the end result is correct, the evaluation process is abnormal. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46696) In ResourceProfileManager, function calls should occur after variable declarations.
liangyongyuan created SPARK-46696: - Summary: In ResourceProfileManager, function calls should occur after variable declarations. Key: SPARK-46696 URL: https://issues.apache.org/jira/browse/SPARK-46696 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.5.0 Reporter: liangyongyuan As the title suggests, in *ResourceProfileManager*, function calls should be made after variable declarations. When determining *isSupport*, all variables are uninitialized, with booleans defaulting to false and objects to null. While the end result is correct, the evaluation process is abnormal. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46695) Always setting hive.execution.engine to mr
[ https://issues.apache.org/jira/browse/SPARK-46695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46695: --- Labels: pull-request-available (was: ) > Always setting hive.execution.engine to mr > -- > > Key: SPARK-46695 > URL: https://issues.apache.org/jira/browse/SPARK-46695 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46695) Always setting hive.execution.engine to mr
Cheng Pan created SPARK-46695: - Summary: Always setting hive.execution.engine to mr Key: SPARK-46695 URL: https://issues.apache.org/jira/browse/SPARK-46695 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46694) Drop the assumptions of 'hive version < 2.0' in Hive version related tests
[ https://issues.apache.org/jira/browse/SPARK-46694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46694: --- Labels: pull-request-available (was: ) > Drop the assumptions of 'hive version < 2.0' in Hive version related tests > -- > > Key: SPARK-46694 > URL: https://issues.apache.org/jira/browse/SPARK-46694 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46694) Drop the assumptions of 'hive version < 2.0' in Hive version related tests
Kent Yao created SPARK-46694: Summary: Drop the assumptions of 'hive version < 2.0' in Hive version related tests Key: SPARK-46694 URL: https://issues.apache.org/jira/browse/SPARK-46694 Project: Spark Issue Type: Test Components: Tests Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46429) avoid duplicate Classes and Resources in classpath of SPARK_HOME/jars/*.jar
[ https://issues.apache.org/jira/browse/SPARK-46429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-46429: - Affects Version/s: (was: 3.5.2) > avoid duplicate Classes and Resources in classpath of SPARK_HOME/jars/*.jar > --- > > Key: SPARK-46429 > URL: https://issues.apache.org/jira/browse/SPARK-46429 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0, 4.0.0, 3.5.1 >Reporter: Arnaud Nauwynck >Priority: Minor > > There are 3679 duplicate resources (classes and other files) in the classpath > of "${SPARK_HOME}/jars/*.jar", amoung the 90756 classes. > This does not have impact for spark itself (eventhough it might have), > but is annoying for end-users who want to check they do not redeploy > additionnal redundant classes already in the classpath of spark runtime + > hadoop + cloud specific environment. > At compile-time, it is possible to check for such duplicate classes using for > example the maven plugin com.github.eirslett:maven-versions-plugin, but at > runtime, it is much more difficult because you might discover latelly the > provisionned environment you are running on (example: Azure HDInsight, etc..) > Here is a minimalist sample code to check for duplicate classes, and printing > a summary report by duplicate jars: > [https://github.com/Arnaud-Nauwynck/test-snippets/tree/master/test-classgraph-duplicate|https://github.com/Arnaud-Nauwynck/test-snippets/tree/master/test-classgraph-duplicate] > Running it on the bare spark 3.5.0 distribution, we get theses warnings: > We see that many guava classes are packaged twice, because the shaded > "hadoop-client-runtime-3.3.4.jar" (with 18626 resources) has 927 duplicate(s) > also in "hadoop-shaded-guava-1.1.1.jar" (with 2428 resources) > Another example: "javax.jdo-3.2.0-m3.jar" (with 252 resources) has 174 > duplicate(s) in "jdo-api-3.0.1.jar" (with 213 resources). It is quite clear > that "javax.jdo-3.2.0-m3.jar" already contains a source copy of all the > classes of "jdo-api" jar, instead of defining a maven dependency. (see for > example the pom: > https://github.com/datanucleus/javax.jdo/blob/master/pom.xml#L51 > , and some class copy : > https://github.com/datanucleus/javax.jdo/blob/master/src/main/java/javax/jdo/annotations/ForeignKey.java#L35 > ) > In summary, we can see duplicates for classes in "guava", "checkerframework", > "parquet", "jdo-api", "jta", "orc", etc. > {noformat} > scanned 90756 classes > found 3679 resource duplicate(s) > Found duplicate resources among 256 x META-INF/MANIFEST, 22 x > META-INF/INDEX.LIST, 25 x META-INF/jandex.idx, 604 x other META-INF/**, > 3 x NOTICE, 3 x LICENSE, > 30 x package-info.class, 20 x module-info.class, > 4284 x inner classes, 22 x UnusedStubClass, > 20 x manifest.vm, 21 x schema/validation-schema.json, 21 x > schema/kube-schema.json, > Jar C:\apps\spark\spark-3.5.0\jars\datanucleus-api-jdo-4.2.4.jar (with 151 > resources) has 1 duplicate in > C:\apps\spark\spark-3.5.0\jars\datanucleus-rdbms-4.1.19.jar (with 781 > resources) >for resources plugin.xml > Jar C:\apps\spark\spark-3.5.0\jars\hadoop-client-runtime-3.3.4.jar (with > 18626 resources) has 927 duplicate(s) in > C:\apps\spark\spark-3.5.0\jars\hadoop-shaded-guava-1.1.1.jar (with 2428 > resources) >for resources with common prefix 'org/apache/hadoop/thirdparty/': > com/google/common/reflect/Reflection.class, > com/google/errorprone/annotations/CompatibleWith.class, > com/google/common/reflect/AbstractInvocationHandler.class, > com/google/common/graph/Traverser.class, > com/google/common/base/FinalizableSoftReference.class, > com/google/common/collect/AbstractSortedSetMultimap.class, > com/google/common/cache/Cache.class, > com/google/common/graph/UndirectedNetworkConnections.class, > com/google/common/hash/LongAddable.class, > com/google/common/io/ByteSource.class, > com/google/common/collect/SparseImmutableTable.class, > com/google/common/primitives/ImmutableDoubleArray.class, > org/checkerframework/checker/nullness/qual/EnsuresNonNullIf.class, > com/google/common/io/FileBackedOutputStream.class, > com/google/common/collect/SortedMultisetBridge.class, > com/google/common/collect/ImmutableListMultimap.class, > org/checkerframework/checker/units/qual/Length.class, > org/checkerframework/framework/qual/MonotonicQualifier.class, > org/checkerframework/checker/units/qual/m2.class, > com/google/common/collect/ImmutableMultimap.class, > org/checkerframework/common/util/report/qual/ReportUnqualified.class, > com/google/common/collect/Range.class, > com/google/common/hash/LittleEndianByteArray.class, > com/google/common/collect/Serialization.class, > com/google/common/collect/BoundType.class,
[jira] [Updated] (SPARK-46684) CoGroup.applyInPandas/Arrow should pass arguments properly
[ https://issues.apache.org/jira/browse/SPARK-46684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-46684: - Fix Version/s: 3.5.1 > CoGroup.applyInPandas/Arrow should pass arguments properly > -- > > Key: SPARK-46684 > URL: https://issues.apache.org/jira/browse/SPARK-46684 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.1 > > > In Spark Connect, {{CoGroup.applyInPandas/Arrow}} doesn't take arguments > properly, so the arguments of the UDF can be broken: > {noformat} > >>> import pandas as pd > >>> > >>> df1 = spark.createDataFrame( > ... [(1, 1.0, "a"), (2, 2.0, "b"), (1, 3.0, "c"), (2, 4.0, "d")], ("id", > "v1", "v2") > ... ) > >>> df2 = spark.createDataFrame([(1, "x"), (2, "y"), (1, "z")], ("id", "v3")) > >>> > >>> def summarize(left, right): > ... return pd.DataFrame( > ... { > ... "left_rows": [len(left)], > ... "left_columns": [len(left.columns)], > ... "right_rows": [len(right)], > ... "right_columns": [len(right.columns)], > ... } > ... ) > ... > >>> df = ( > ... df1.groupby("id") > ... .cogroup(df2.groupby("id")) > ... .applyInPandas( > ... summarize, > ... schema="left_rows long, left_columns long, right_rows long, > right_columns long", > ... ) > ... ) > >>> > >>> df.show() > +-++--+-+ > |left_rows|left_columns|right_rows|right_columns| > +-++--+-+ > |2| 1| 2|1| > |2| 1| 1|1| > +-++--+-+ > {noformat} > The result should be: > {noformat} > +-++--+-+ > |left_rows|left_columns|right_rows|right_columns| > +-++--+-+ > | 2| 3| 2| 2| > | 2| 3| 1| 2| > +-++--+-+ > {noformat} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46684) CoGroup.applyInPandas/Arrow should pass arguments properly
[ https://issues.apache.org/jira/browse/SPARK-46684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-46684: Assignee: Takuya Ueshin > CoGroup.applyInPandas/Arrow should pass arguments properly > -- > > Key: SPARK-46684 > URL: https://issues.apache.org/jira/browse/SPARK-46684 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Labels: pull-request-available > > In Spark Connect, {{CoGroup.applyInPandas/Arrow}} doesn't take arguments > properly, so the arguments of the UDF can be broken: > {noformat} > >>> import pandas as pd > >>> > >>> df1 = spark.createDataFrame( > ... [(1, 1.0, "a"), (2, 2.0, "b"), (1, 3.0, "c"), (2, 4.0, "d")], ("id", > "v1", "v2") > ... ) > >>> df2 = spark.createDataFrame([(1, "x"), (2, "y"), (1, "z")], ("id", "v3")) > >>> > >>> def summarize(left, right): > ... return pd.DataFrame( > ... { > ... "left_rows": [len(left)], > ... "left_columns": [len(left.columns)], > ... "right_rows": [len(right)], > ... "right_columns": [len(right.columns)], > ... } > ... ) > ... > >>> df = ( > ... df1.groupby("id") > ... .cogroup(df2.groupby("id")) > ... .applyInPandas( > ... summarize, > ... schema="left_rows long, left_columns long, right_rows long, > right_columns long", > ... ) > ... ) > >>> > >>> df.show() > +-++--+-+ > |left_rows|left_columns|right_rows|right_columns| > +-++--+-+ > |2| 1| 2|1| > |2| 1| 1|1| > +-++--+-+ > {noformat} > The result should be: > {noformat} > +-++--+-+ > |left_rows|left_columns|right_rows|right_columns| > +-++--+-+ > | 2| 3| 2| 2| > | 2| 3| 1| 2| > +-++--+-+ > {noformat} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46684) CoGroup.applyInPandas/Arrow should pass arguments properly
[ https://issues.apache.org/jira/browse/SPARK-46684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-46684. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44695 [https://github.com/apache/spark/pull/44695] > CoGroup.applyInPandas/Arrow should pass arguments properly > -- > > Key: SPARK-46684 > URL: https://issues.apache.org/jira/browse/SPARK-46684 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > In Spark Connect, {{CoGroup.applyInPandas/Arrow}} doesn't take arguments > properly, so the arguments of the UDF can be broken: > {noformat} > >>> import pandas as pd > >>> > >>> df1 = spark.createDataFrame( > ... [(1, 1.0, "a"), (2, 2.0, "b"), (1, 3.0, "c"), (2, 4.0, "d")], ("id", > "v1", "v2") > ... ) > >>> df2 = spark.createDataFrame([(1, "x"), (2, "y"), (1, "z")], ("id", "v3")) > >>> > >>> def summarize(left, right): > ... return pd.DataFrame( > ... { > ... "left_rows": [len(left)], > ... "left_columns": [len(left.columns)], > ... "right_rows": [len(right)], > ... "right_columns": [len(right.columns)], > ... } > ... ) > ... > >>> df = ( > ... df1.groupby("id") > ... .cogroup(df2.groupby("id")) > ... .applyInPandas( > ... summarize, > ... schema="left_rows long, left_columns long, right_rows long, > right_columns long", > ... ) > ... ) > >>> > >>> df.show() > +-++--+-+ > |left_rows|left_columns|right_rows|right_columns| > +-++--+-+ > |2| 1| 2|1| > |2| 1| 1|1| > +-++--+-+ > {noformat} > The result should be: > {noformat} > +-++--+-+ > |left_rows|left_columns|right_rows|right_columns| > +-++--+-+ > | 2| 3| 2| 2| > | 2| 3| 1| 2| > +-++--+-+ > {noformat} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46588) Interrupt when executing ANALYSIS phase
[ https://issues.apache.org/jira/browse/SPARK-46588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-46588. -- Resolution: Information Provided Jira is not a suitable place for questions, you'd better use the user mailing lists > Interrupt when executing ANALYSIS phase > --- > > Key: SPARK-46588 > URL: https://issues.apache.org/jira/browse/SPARK-46588 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.4.0 >Reporter: JacobZheng >Priority: Major > > I have a long-running spark app on which I will start many tasks. When I am > executing complex tasks, I may spend a lot of time in the ANALYSIS phase or > the OPTIMIZATION phase or run into oom exceptions. I will cancel the task > when timeout is detected by calling cancelJobGroup method. However, the task > is not interrupted and the execution plan is still being generated. I{*}s > there a way to interrupt these phases?{*} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46693) Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset
[ https://issues.apache.org/jira/browse/SPARK-46693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46693: --- Labels: pull-request-available (was: ) > Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset > > > Key: SPARK-46693 > URL: https://issues.apache.org/jira/browse/SPARK-46693 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.5.0 >Reporter: Nick Young >Priority: Major > Labels: pull-request-available > > For queries containing both a LIMIT and an OFFSET in a subquery, physical > translation will drop the `LocalLimit` planned in the optimizer stage by > mistake; this manifests as larger than necessary shuffle sizes for > `GlobalLimitExec`. Fix to not drop this node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46693) Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset
Nick Young created SPARK-46693: -- Summary: Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset Key: SPARK-46693 URL: https://issues.apache.org/jira/browse/SPARK-46693 Project: Spark Issue Type: Improvement Components: Optimizer Affects Versions: 3.5.0 Reporter: Nick Young For queries containing both a LIMIT and an OFFSET in a subquery, physical translation will drop the `LocalLimit` planned in the optimizer stage by mistake; this manifests as larger than necessary shuffle sizes for `GlobalLimitExec`. Fix to not drop this node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46588) Interrupt when executing ANALYSIS phase
[ https://issues.apache.org/jira/browse/SPARK-46588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805873#comment-17805873 ] Kent Yao commented on SPARK-46588: -- You can call sc.setInterruptOnCancel(true) to interrupt the running task. After spark 4.0, you can also set spark.sql.execution.interruptOnCancel=true using configrations > Interrupt when executing ANALYSIS phase > --- > > Key: SPARK-46588 > URL: https://issues.apache.org/jira/browse/SPARK-46588 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 3.4.0 >Reporter: JacobZheng >Priority: Major > > I have a long-running spark app on which I will start many tasks. When I am > executing complex tasks, I may spend a lot of time in the ANALYSIS phase or > the OPTIMIZATION phase or run into oom exceptions. I will cancel the task > when timeout is detected by calling cancelJobGroup method. However, the task > is not interrupted and the execution plan is still being generated. I{*}s > there a way to interrupt these phases?{*} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46612) Clickhouse's JDBC throws `java.lang.IllegalArgumentException: Unknown data type: string` when write array string with Apache Spark scala
[ https://issues.apache.org/jira/browse/SPARK-46612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46612: --- Labels: pull-request-available (was: ) > Clickhouse's JDBC throws `java.lang.IllegalArgumentException: Unknown data > type: string` when write array string with Apache Spark scala > > > Key: SPARK-46612 > URL: https://issues.apache.org/jira/browse/SPARK-46612 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Nguyen Phan Huy >Priority: Major > Labels: pull-request-available > > Issue is also reported on Clickhouse's github: > [https://github.com/ClickHouse/clickhouse-java/issues/1505] > h3. Bug description > When using Scala spark to write an array of string to Clickhouse, the driver > throws {{java.lang.IllegalArgumentException: Unknown data type: string}} > exception. > Exception is thrown by: > [https://github.com/ClickHouse/clickhouse-java/blob/aa3870eadb1a2d3675fd5119714c85851800f076/clickhouse-data/src/main/java/com/clickhouse/data/ClickHouseDataType.java#L238] > This was caused by Spark JDBC Utils tried to cast the type to lower case > ({{{}String{}}} -> {{{}string{}}}). > [https://github.com/apache/spark/blob/6b931530d75cb4f00236f9c6283de8ef450963ad/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L639] > h3. Steps to reproduce > # Create Clickhouse table with String Array field > ([https://clickhouse.com/]). > # Write data to the table with scala Spark, via Clickhouse's JDBC > ([https://github.com/ClickHouse/clickhouse-java)] > {code:java} >// code extraction, will need to setup a Scala Spark job with clickhouse > jdbc > val clickHouseSchema = StructType( > Seq( > StructField("str_array", ArrayType(StringType)) > ) > ) > val data = Seq( > Row( > Seq("a", "b") > ) > ) > val clickHouseDf = spark.createDataFrame(sc.parallelize(data), > clickHouseSchema) > > val props = new Properties > props.put("user", "default") > clickHouseDf.write > .mode(SaveMode.Append) > .option("driver", com.clickhouse.jdbc.ClickHouseDriver) > .jdbc("jdbc:clickhouse://localhost:8123/foo", table = "bar", props) > {code} > h2. Fix > - [https://github.com/apache/spark/pull/44459] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46650) Replace AtomicBoolean with volatile boolean
[ https://issues.apache.org/jira/browse/SPARK-46650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-46650. -- Resolution: Not A Problem > Replace AtomicBoolean with volatile boolean > --- > > Key: SPARK-46650 > URL: https://issues.apache.org/jira/browse/SPARK-46650 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25895) No test to compare Zstd and Lz4 Compression Algorithm
[ https://issues.apache.org/jira/browse/SPARK-25895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-25895: --- Labels: pull-request-available (was: ) > No test to compare Zstd and Lz4 Compression Algorithm > - > > Key: SPARK-25895 > URL: https://issues.apache.org/jira/browse/SPARK-25895 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Udbhav Agrawal >Priority: Minor > Labels: pull-request-available > > As per Jira SPARK-19112 Zstd Compression ratio is better than the default > compression Codec i.e lz4, this test compares the shuffle spill, shuffle read > and shuffle write values when both the compression codec is used and there > was no UT to verify the same. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46692) Fix potential issues with environment variable transmission `PYTHON_TO_TEST`
[ https://issues.apache.org/jira/browse/SPARK-46692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-46692: Summary: Fix potential issues with environment variable transmission `PYTHON_TO_TEST` (was: Fix potential issues with environment variable transmission `PYTHON_TO_TEST` in `build_python`) > Fix potential issues with environment variable transmission `PYTHON_TO_TEST` > > > Key: SPARK-46692 > URL: https://issues.apache.org/jira/browse/SPARK-46692 > Project: Spark > Issue Type: Bug > Components: Build, PySpark >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46692) Fix potential issues with environment variable transmission `PYTHON_TO_TEST` in `build_python`
[ https://issues.apache.org/jira/browse/SPARK-46692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46692: --- Labels: pull-request-available (was: ) > Fix potential issues with environment variable transmission `PYTHON_TO_TEST` > in `build_python` > -- > > Key: SPARK-46692 > URL: https://issues.apache.org/jira/browse/SPARK-46692 > Project: Spark > Issue Type: Bug > Components: Build, PySpark >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46383) Reduce Driver Heap Usage by Reducing the Lifespan of `TaskInfo.accumulables()`
[ https://issues.apache.org/jira/browse/SPARK-46383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-46383: --- Assignee: Utkarsh Agarwal > Reduce Driver Heap Usage by Reducing the Lifespan of `TaskInfo.accumulables()` > -- > > Key: SPARK-46383 > URL: https://issues.apache.org/jira/browse/SPARK-46383 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Utkarsh Agarwal >Assignee: Utkarsh Agarwal >Priority: Major > Labels: pull-request-available > Attachments: Screenshot 2023-11-06 at 3.56.26 PM.png, screenshot-1.png > > > `AccumulableInfo` is one of the top heap consumers in driver's heap dumps for > stages with many tasks. For a stage with a large number of tasks > ({_}O(100k){_}), we saw {*}{{*}}30%{{*}}{*} of the heap usage stemming from > `TaskInfo.accumulables()`. > !screenshot-1.png|width=641,height=98! > The `TaskSetManager` today keeps around the TaskInfo objects > ([ref1|https://github.com/apache/spark/blob/c1ba963e64a22dea28e17b1ed954e6d03d38da1e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L134], > > [ref2|https://github.com/apache/spark/blob/c1ba963e64a22dea28e17b1ed954e6d03d38da1e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L192])) > and in turn the task metrics (`AccumulableInfo`) for every task attempt > until the stage is completed. This means that for stages with a large number > of tasks, we keep metrics for all the tasks (`AccumulableInfo`) around even > when the task has completed and its metrics have been aggregated. Given a > task has a large number of metrics, stages with many tasks end up with a > large heap usage in the form of task metrics. > Ideally, we should clear up a task's TaskInfo upon the task's completion, > thereby reducing the driver's heap usage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46383) Reduce Driver Heap Usage by Reducing the Lifespan of `TaskInfo.accumulables()`
[ https://issues.apache.org/jira/browse/SPARK-46383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-46383. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44321 [https://github.com/apache/spark/pull/44321] > Reduce Driver Heap Usage by Reducing the Lifespan of `TaskInfo.accumulables()` > -- > > Key: SPARK-46383 > URL: https://issues.apache.org/jira/browse/SPARK-46383 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Utkarsh Agarwal >Assignee: Utkarsh Agarwal >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Attachments: Screenshot 2023-11-06 at 3.56.26 PM.png, screenshot-1.png > > > `AccumulableInfo` is one of the top heap consumers in driver's heap dumps for > stages with many tasks. For a stage with a large number of tasks > ({_}O(100k){_}), we saw {*}{{*}}30%{{*}}{*} of the heap usage stemming from > `TaskInfo.accumulables()`. > !screenshot-1.png|width=641,height=98! > The `TaskSetManager` today keeps around the TaskInfo objects > ([ref1|https://github.com/apache/spark/blob/c1ba963e64a22dea28e17b1ed954e6d03d38da1e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L134], > > [ref2|https://github.com/apache/spark/blob/c1ba963e64a22dea28e17b1ed954e6d03d38da1e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L192])) > and in turn the task metrics (`AccumulableInfo`) for every task attempt > until the stage is completed. This means that for stages with a large number > of tasks, we keep metrics for all the tasks (`AccumulableInfo`) around even > when the task has completed and its metrics have been aggregated. Given a > task has a large number of metrics, stages with many tasks end up with a > large heap usage in the form of task metrics. > Ideally, we should clear up a task's TaskInfo upon the task's completion, > thereby reducing the driver's heap usage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46640) RemoveRedundantAliases does not account for SubqueryExpression when removing aliases
[ https://issues.apache.org/jira/browse/SPARK-46640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-46640. - Fix Version/s: 3.5.1 4.0.0 Resolution: Fixed Issue resolved by pull request 44645 [https://github.com/apache/spark/pull/44645] > RemoveRedundantAliases does not account for SubqueryExpression when removing > aliases > > > Key: SPARK-46640 > URL: https://issues.apache.org/jira/browse/SPARK-46640 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 4.0.0 >Reporter: Nikhil Sheoran >Priority: Minor > Labels: pull-request-available > Fix For: 3.5.1, 4.0.0 > > > `RemoveRedundantAliases{{{}`{}}} does not take into account the outer > attributes of a `SubqueryExpression` aliases, potentially removing them if it > thinks they are redundant. > This can cause scenarios where a subquery expression has conditions like `a#x > = a#x` i.e. both the attribute names and the expression ID(s) are the same. > This can then lead to conflicting expression ID(s) error. > In `RemoveRedundantAliases`, we have an excluded AttributeSet argument > denoting the references for which we should not remove aliases. For a query > with a subquery expression, adding the references of this subquery in the > excluded set prevents such rewrite from happening. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46692) Fix potential issues with environment variable transmission `PYTHON_TO_TEST` in `build_python`
[ https://issues.apache.org/jira/browse/SPARK-46692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-46692: Summary: Fix potential issues with environment variable transmission `PYTHON_TO_TEST` in `build_python` (was: Fix potential issues with environment variable transmission `$PYTHON_TO_TEST` in `build_python`) > Fix potential issues with environment variable transmission `PYTHON_TO_TEST` > in `build_python` > -- > > Key: SPARK-46692 > URL: https://issues.apache.org/jira/browse/SPARK-46692 > Project: Spark > Issue Type: Bug > Components: Build, PySpark >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46640) RemoveRedundantAliases does not account for SubqueryExpression when removing aliases
[ https://issues.apache.org/jira/browse/SPARK-46640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-46640: --- Assignee: Nikhil Sheoran > RemoveRedundantAliases does not account for SubqueryExpression when removing > aliases > > > Key: SPARK-46640 > URL: https://issues.apache.org/jira/browse/SPARK-46640 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 4.0.0 >Reporter: Nikhil Sheoran >Assignee: Nikhil Sheoran >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0, 3.5.1 > > > `RemoveRedundantAliases{{{}`{}}} does not take into account the outer > attributes of a `SubqueryExpression` aliases, potentially removing them if it > thinks they are redundant. > This can cause scenarios where a subquery expression has conditions like `a#x > = a#x` i.e. both the attribute names and the expression ID(s) are the same. > This can then lead to conflicting expression ID(s) error. > In `RemoveRedundantAliases`, we have an excluded AttributeSet argument > denoting the references for which we should not remove aliases. For a query > with a subquery expression, adding the references of this subquery in the > excluded set prevents such rewrite from happening. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46692) Fix potential issues with environment variable transmission `$PYTHON_TO_TEST` in `build_python`
BingKun Pan created SPARK-46692: --- Summary: Fix potential issues with environment variable transmission `$PYTHON_TO_TEST` in `build_python` Key: SPARK-46692 URL: https://issues.apache.org/jira/browse/SPARK-46692 Project: Spark Issue Type: Bug Components: Build, PySpark Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46670) Make DataSourceManager isolated and self clone-able
[ https://issues.apache.org/jira/browse/SPARK-46670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-46670: Assignee: Hyukjin Kwon > Make DataSourceManager isolated and self clone-able > > > Key: SPARK-46670 > URL: https://issues.apache.org/jira/browse/SPARK-46670 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > > Make DataSourceManager isolated and self clone-able -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46670) Make DataSourceManager isolated and self clone-able
[ https://issues.apache.org/jira/browse/SPARK-46670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-46670. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44681 [https://github.com/apache/spark/pull/44681] > Make DataSourceManager isolated and self clone-able > > > Key: SPARK-46670 > URL: https://issues.apache.org/jira/browse/SPARK-46670 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > Make DataSourceManager isolated and self clone-able -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46686) Basic support of SparkSession based Python UDF profiler
[ https://issues.apache.org/jira/browse/SPARK-46686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46686: --- Labels: pull-request-available (was: ) > Basic support of SparkSession based Python UDF profiler > --- > > Key: SPARK-46686 > URL: https://issues.apache.org/jira/browse/SPARK-46686 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Takuya Ueshin >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46691) Support profiling on WindowInPandasExec
Takuya Ueshin created SPARK-46691: - Summary: Support profiling on WindowInPandasExec Key: SPARK-46691 URL: https://issues.apache.org/jira/browse/SPARK-46691 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46690) Support profiling on FlatMapCoGroupsInBatchExec
Takuya Ueshin created SPARK-46690: - Summary: Support profiling on FlatMapCoGroupsInBatchExec Key: SPARK-46690 URL: https://issues.apache.org/jira/browse/SPARK-46690 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46689) Support profiling on FlatMapGroupsInBatchExec
Takuya Ueshin created SPARK-46689: - Summary: Support profiling on FlatMapGroupsInBatchExec Key: SPARK-46689 URL: https://issues.apache.org/jira/browse/SPARK-46689 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46688) Support profiling on AggregateInPandasExec
Takuya Ueshin created SPARK-46688: - Summary: Support profiling on AggregateInPandasExec Key: SPARK-46688 URL: https://issues.apache.org/jira/browse/SPARK-46688 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46687) Implement memory-profiler
Takuya Ueshin created SPARK-46687: - Summary: Implement memory-profiler Key: SPARK-46687 URL: https://issues.apache.org/jira/browse/SPARK-46687 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46686) Basic support of SparkSession based Python UDF profiler
Takuya Ueshin created SPARK-46686: - Summary: Basic support of SparkSession based Python UDF profiler Key: SPARK-46686 URL: https://issues.apache.org/jira/browse/SPARK-46686 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46685) Introduce SparkSession based PySpark UDF profiler
Takuya Ueshin created SPARK-46685: - Summary: Introduce SparkSession based PySpark UDF profiler Key: SPARK-46685 URL: https://issues.apache.org/jira/browse/SPARK-46685 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 4.0.0 Reporter: Takuya Ueshin The existing UDF profilers are SparkContext based, which can't support Spark Connect. We should introduce SparkSession based profilers and support Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46667) XML: Throw error on multiple XML data source
[ https://issues.apache.org/jira/browse/SPARK-46667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-46667: Assignee: Sandip Agarwala > XML: Throw error on multiple XML data source > > > Key: SPARK-46667 > URL: https://issues.apache.org/jira/browse/SPARK-46667 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Sandip Agarwala >Assignee: Sandip Agarwala >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46667) XML: Throw error on multiple XML data source
[ https://issues.apache.org/jira/browse/SPARK-46667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-46667. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44685 [https://github.com/apache/spark/pull/44685] > XML: Throw error on multiple XML data source > > > Key: SPARK-46667 > URL: https://issues.apache.org/jira/browse/SPARK-46667 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Sandip Agarwala >Assignee: Sandip Agarwala >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46682) Upgrade `curator` to 5.6.0
[ https://issues.apache.org/jira/browse/SPARK-46682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46682. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44694 [https://github.com/apache/spark/pull/44694] > Upgrade `curator` to 5.6.0 > -- > > Key: SPARK-46682 > URL: https://issues.apache.org/jira/browse/SPARK-46682 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46683) Write a subquery generator that generates subqueries of different variations to increase testing coverage in this area
[ https://issues.apache.org/jira/browse/SPARK-46683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46683: --- Labels: correctness pull-request-available testing (was: correctness testing) > Write a subquery generator that generates subqueries of different variations > to increase testing coverage in this area > -- > > Key: SPARK-46683 > URL: https://issues.apache.org/jira/browse/SPARK-46683 > Project: Spark > Issue Type: Test > Components: Optimizer, SQL >Affects Versions: 3.5.1 >Reporter: Andy Lam >Priority: Major > Labels: correctness, pull-request-available, testing > > There are a lot of subquery correctness issues, ranging from very old bugs to > new ones that are being introduced due to work being done on subquery > correlation optimization. This is especially in the areas of COUNT bugs and > null behaviors. > To increase test coverage and robustness in this area, we want to write a > subquery generator that writes variations of subqueries, producing SQL tests > that also run against Postgres (from my work in SPARK-46179). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46684) CoGroup.applyInPandas/Arrow should pass arguments properly
[ https://issues.apache.org/jira/browse/SPARK-46684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46684: --- Labels: pull-request-available (was: ) > CoGroup.applyInPandas/Arrow should pass arguments properly > -- > > Key: SPARK-46684 > URL: https://issues.apache.org/jira/browse/SPARK-46684 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Takuya Ueshin >Priority: Major > Labels: pull-request-available > > In Spark Connect, {{CoGroup.applyInPandas/Arrow}} doesn't take arguments > properly, so the arguments of the UDF can be broken: > {noformat} > >>> import pandas as pd > >>> > >>> df1 = spark.createDataFrame( > ... [(1, 1.0, "a"), (2, 2.0, "b"), (1, 3.0, "c"), (2, 4.0, "d")], ("id", > "v1", "v2") > ... ) > >>> df2 = spark.createDataFrame([(1, "x"), (2, "y"), (1, "z")], ("id", "v3")) > >>> > >>> def summarize(left, right): > ... return pd.DataFrame( > ... { > ... "left_rows": [len(left)], > ... "left_columns": [len(left.columns)], > ... "right_rows": [len(right)], > ... "right_columns": [len(right.columns)], > ... } > ... ) > ... > >>> df = ( > ... df1.groupby("id") > ... .cogroup(df2.groupby("id")) > ... .applyInPandas( > ... summarize, > ... schema="left_rows long, left_columns long, right_rows long, > right_columns long", > ... ) > ... ) > >>> > >>> df.show() > +-++--+-+ > |left_rows|left_columns|right_rows|right_columns| > +-++--+-+ > |2| 1| 2|1| > |2| 1| 1|1| > +-++--+-+ > {noformat} > The result should be: > {noformat} > +-++--+-+ > |left_rows|left_columns|right_rows|right_columns| > +-++--+-+ > | 2| 3| 2| 2| > | 2| 3| 1| 2| > +-++--+-+ > {noformat} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46665) Remove assertPandasOnSparkEqual
[ https://issues.apache.org/jira/browse/SPARK-46665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-46665: Summary: Remove assertPandasOnSparkEqual (was: Remove Pandas dependency for pyspark.testing) > Remove assertPandasOnSparkEqual > --- > > Key: SPARK-46665 > URL: https://issues.apache.org/jira/browse/SPARK-46665 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > > We should not make pyspark.testing depending on Pandas. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46665) Remove assertPandasOnSparkEqual
[ https://issues.apache.org/jira/browse/SPARK-46665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-46665: Description: Remove deprecated API (was: We should not make pyspark.testing depending on Pandas.) > Remove assertPandasOnSparkEqual > --- > > Key: SPARK-46665 > URL: https://issues.apache.org/jira/browse/SPARK-46665 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > > Remove deprecated API -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46684) CoGroup.applyInPandas/Arrow should pass arguments properly
Takuya Ueshin created SPARK-46684: - Summary: CoGroup.applyInPandas/Arrow should pass arguments properly Key: SPARK-46684 URL: https://issues.apache.org/jira/browse/SPARK-46684 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.5.0 Reporter: Takuya Ueshin In Spark Connect, {{CoGroup.applyInPandas/Arrow}} doesn't take arguments properly, so the arguments of the UDF can be broken: {noformat} >>> import pandas as pd >>> >>> df1 = spark.createDataFrame( ... [(1, 1.0, "a"), (2, 2.0, "b"), (1, 3.0, "c"), (2, 4.0, "d")], ("id", "v1", "v2") ... ) >>> df2 = spark.createDataFrame([(1, "x"), (2, "y"), (1, "z")], ("id", "v3")) >>> >>> def summarize(left, right): ... return pd.DataFrame( ... { ... "left_rows": [len(left)], ... "left_columns": [len(left.columns)], ... "right_rows": [len(right)], ... "right_columns": [len(right.columns)], ... } ... ) ... >>> df = ( ... df1.groupby("id") ... .cogroup(df2.groupby("id")) ... .applyInPandas( ... summarize, ... schema="left_rows long, left_columns long, right_rows long, right_columns long", ... ) ... ) >>> >>> df.show() +-++--+-+ |left_rows|left_columns|right_rows|right_columns| +-++--+-+ |2| 1| 2|1| |2| 1| 1|1| +-++--+-+ {noformat} The result should be: {noformat} +-++--+-+ |left_rows|left_columns|right_rows|right_columns| +-++--+-+ | 2| 3| 2| 2| | 2| 3| 1| 2| +-++--+-+ {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46683) Write a subquery generator that generates subqueries of different variations to increase testing coverage in this area
Andy Lam created SPARK-46683: Summary: Write a subquery generator that generates subqueries of different variations to increase testing coverage in this area Key: SPARK-46683 URL: https://issues.apache.org/jira/browse/SPARK-46683 Project: Spark Issue Type: Test Components: Optimizer, SQL Affects Versions: 3.5.1 Reporter: Andy Lam There are a lot of subquery correctness issues, ranging from very old bugs to new ones that are being introduced due to work being done on subquery correlation optimization. This is especially in the areas of COUNT bugs and null behaviors. To increase test coverage and robustness in this area, we want to write a subquery generator that writes variations of subqueries, producing SQL tests that also run against Postgres (from my work in SPARK-46179). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46682) Upgrade `curator` to 5.6.0
Dongjoon Hyun created SPARK-46682: - Summary: Upgrade `curator` to 5.6.0 Key: SPARK-46682 URL: https://issues.apache.org/jira/browse/SPARK-46682 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46368) Support `readyz` in REST Submission API
[ https://issues.apache.org/jira/browse/SPARK-46368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46368. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44692 [https://github.com/apache/spark/pull/44692] > Support `readyz` in REST Submission API > --- > > Key: SPARK-46368 > URL: https://issues.apache.org/jira/browse/SPARK-46368 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Like https://kubernetes.io/docs/reference/using-api/health-checks/, we need > to provide `/readyz` API. > As a workaround, we can use the following. > {code} > readinessProbe: > exec: > command: ["sh", "-c", "! (curl -s > http://localhost:6066/v1/submissions/status/none | grep -q STANDBY)"] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46671) InferFiltersFromConstraint rule is creating a redundant filter
[ https://issues.apache.org/jira/browse/SPARK-46671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-46671: - Description: while bring my old PR which uses a different approach to the ConstraintPropagation algorithm ( [SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch with current master, I noticed a test failure in my branch for SPARK-33152: The test which is failing is InferFiltersFromConstraintSuite: {code} test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: Infer Filters") { val x = testRelation.as("x") val y = testRelation.as("y") val z = testRelation.as("z") // Removes EqualNullSafe when constructing candidate constraints comparePlans( InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa")) .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze), x.select($"x.a", $"x.a".as("xa")) .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && $"xa" === $"x.a").analyze) // Once strategy's idempotence is not broken val originalQuery = x.join(y, condition = Some($"x.a" === $"y.a")) .select($"x.a", $"x.a".as("xa")).as("xy") .join(z, condition = Some($"xy.a" === $"z.a")).analyze val correctAnswer = x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = Some($"x.a" === $"y.a")) .select($"x.a", $"x.a".as("xa")).as("xy") .join(z.where($"a".isNotNull), condition = Some($"xy.a" === $"z.a")).analyze val optimizedQuery = InferFiltersFromConstraints(originalQuery) comparePlans(optimizedQuery, correctAnswer) comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer) } {code} In the above test, I believe the below assertion is not proper. There is a redundant filter which is getting created. Out of these two isNotNull constraints, only one should be created. $"xa".isNotNull && $"x.a".isNotNull Because "xa" is an alias of x."a" , so only one isNullConstraint is needed. // Removes EqualNullSafe when constructing candidate constraints comparePlans( InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa")) .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze), x.select($"x.a", $"x.a".as("xa")) .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && $"xa" === $"x.a").analyze) This is not a big issue, but it highlights the need to take a relook at the code of ConstraintPropagation and related code. I am filing this jira so that constraint code can be tightened/made more robust. was: while bring my old PR which uses a different approach to the ConstraintPropagation algorithm ( [SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch with current master, I noticed a test failure in my branch for SPARK-33152: The test which is failing is InferFiltersFromConstraintSuite: {code} test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: Infer Filters") { val x = testRelation.as("x") val y = testRelation.as("y") val z = testRelation.as("z") // Removes EqualNullSafe when constructing candidate constraints comparePlans( InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa")) .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze), x.select($"x.a", $"x.a".as("xa")) .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && $"xa" === $"x.a").analyze) // Once strategy's idempotence is not broken val originalQuery = x.join(y, condition = Some($"x.a" === $"y.a")) .select($"x.a", $"x.a".as("xa")).as("xy") .join(z, condition = Some($"xy.a" === $"z.a")).analyze val correctAnswer = x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = Some($"x.a" === $"y.a")) .select($"x.a", $"x.a".as("xa")).as("xy") .join(z.where($"a".isNotNull), condition = Some($"xy.a" === $"z.a")).analyze val optimizedQuery = InferFiltersFromConstraints(originalQuery) comparePlans(optimizedQuery, correctAnswer) comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer) } {code} In the above test, I believe the below assertion is not proper. There is a redundant filter which is getting created. Out of these two isNotNull constraints, only one should be created. $"xa".isNotNull && $"x.a".isNotNull Because presence of (xa#0 = a#0), automatically implies that is one attribute is not null, the other also has to be not null. // Removes EqualNullSafe when constructing candidate constraints comparePlans( InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa")) .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze), x.select($"x.a", $"x.a".as("xa")) .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && $"xa" === $"x.a").analyze) This is not a big issue, but it highlights the
[jira] [Reopened] (SPARK-46671) InferFiltersFromConstraint rule is creating a redundant filter
[ https://issues.apache.org/jira/browse/SPARK-46671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif reopened SPARK-46671: -- After further analysis , I believe , that what I said originally in the ticket is valid and that the code Does create a redundant constraint. The reason is "xa" is an alias of "a", so there should be a IsNotNull constraint on only one of the attribute and not both. > InferFiltersFromConstraint rule is creating a redundant filter > -- > > Key: SPARK-46671 > URL: https://issues.apache.org/jira/browse/SPARK-46671 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Minor > Labels: SQL, catalyst > > while bring my old PR which uses a different approach to the > ConstraintPropagation algorithm ( > [SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch > with current master, I noticed a test failure in my branch for SPARK-33152: > The test which is failing is > InferFiltersFromConstraintSuite: > {code} > test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: > Infer Filters") { > val x = testRelation.as("x") > val y = testRelation.as("y") > val z = testRelation.as("z") > // Removes EqualNullSafe when constructing candidate constraints > comparePlans( > InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa")) > .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze), > x.select($"x.a", $"x.a".as("xa")) > .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && > $"xa" === $"x.a").analyze) > // Once strategy's idempotence is not broken > val originalQuery = > x.join(y, condition = Some($"x.a" === $"y.a")) > .select($"x.a", $"x.a".as("xa")).as("xy") > .join(z, condition = Some($"xy.a" === $"z.a")).analyze > val correctAnswer = > x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = > Some($"x.a" === $"y.a")) > .select($"x.a", $"x.a".as("xa")).as("xy") > .join(z.where($"a".isNotNull), condition = Some($"xy.a" === > $"z.a")).analyze > val optimizedQuery = InferFiltersFromConstraints(originalQuery) > comparePlans(optimizedQuery, correctAnswer) > comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer) > } > {code} > In the above test, I believe the below assertion is not proper. > There is a redundant filter which is getting created. > Out of these two isNotNull constraints, only one should be created. > $"xa".isNotNull && $"x.a".isNotNull > Because presence of (xa#0 = a#0), automatically implies that is one > attribute is not null, the other also has to be not null. > // Removes EqualNullSafe when constructing candidate constraints > comparePlans( > InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa")) > .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze), > x.select($"x.a", $"x.a".as("xa")) > .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && > $"xa" === $"x.a").analyze) > This is not a big issue, but it highlights the need to take a relook at the > code of ConstraintPropagation and related code. > I am filing this jira so that constraint code can be tightened/made more > robust. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46655) Skip query context catching in DataFrame methods
[ https://issues.apache.org/jira/browse/SPARK-46655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-46655. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44501 [https://github.com/apache/spark/pull/44501] > Skip query context catching in DataFrame methods > > > Key: SPARK-46655 > URL: https://issues.apache.org/jira/browse/SPARK-46655 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > To improve user experience with Spark DataFrame/Dataset APIs, and provide > more precise context of errors, catching of Dataframe query context can be > skipped in Dataframe/Dataset methods. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46368) Support `readyz` in REST Submission API
[ https://issues.apache.org/jira/browse/SPARK-46368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46368: -- Summary: Support `readyz` in REST Submission API (was: Support `readyz` API) > Support `readyz` in REST Submission API > --- > > Key: SPARK-46368 > URL: https://issues.apache.org/jira/browse/SPARK-46368 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > Like https://kubernetes.io/docs/reference/using-api/health-checks/, we need > to provide `/readyz` API. > As a workaround, we can use the following. > {code} > readinessProbe: > exec: > command: ["sh", "-c", "! (curl -s > http://localhost:6066/v1/submissions/status/none | grep -q STANDBY)"] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46368) Support `readyz` API
[ https://issues.apache.org/jira/browse/SPARK-46368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46368: --- Labels: pull-request-available (was: ) > Support `readyz` API > > > Key: SPARK-46368 > URL: https://issues.apache.org/jira/browse/SPARK-46368 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > Like https://kubernetes.io/docs/reference/using-api/health-checks/, we need > to provide `/readyz` API. > As a workaround, we can use the following. > {code} > readinessProbe: > exec: > command: ["sh", "-c", "! (curl -s > http://localhost:6066/v1/submissions/status/none | grep -q STANDBY)"] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46680) Upgrade Apache commons-pool2 to 2.12.0
[ https://issues.apache.org/jira/browse/SPARK-46680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46680. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44683 [https://github.com/apache/spark/pull/44683] > Upgrade Apache commons-pool2 to 2.12.0 > -- > > Key: SPARK-46680 > URL: https://issues.apache.org/jira/browse/SPARK-46680 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > https://github.com/apache/commons-pool/blob/rel/commons-pool-2.12.0/RELEASE-NOTES.txt -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46681) Refactor `ExecutorFailureTracker#maxNumExecutorFailures` to avoid unnecessary computations when `MAX_EXECUTOR_FAILURES` is configured
[ https://issues.apache.org/jira/browse/SPARK-46681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46681: --- Labels: pull-request-available (was: ) > Refactor `ExecutorFailureTracker#maxNumExecutorFailures` to avoid unnecessary > computations when `MAX_EXECUTOR_FAILURES` is configured > - > > Key: SPARK-46681 > URL: https://issues.apache.org/jira/browse/SPARK-46681 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > Labels: pull-request-available > > {code:java} > def maxNumExecutorFailures(sparkConf: SparkConf): Int = { > val effectiveNumExecutors = > if (Utils.isStreamingDynamicAllocationEnabled(sparkConf)) { > sparkConf.get(STREAMING_DYN_ALLOCATION_MAX_EXECUTORS) > } else if (Utils.isDynamicAllocationEnabled(sparkConf)) { > sparkConf.get(DYN_ALLOCATION_MAX_EXECUTORS) > } else { > sparkConf.get(EXECUTOR_INSTANCES).getOrElse(0) > } > // By default, effectiveNumExecutors is Int.MaxValue if dynamic allocation > is enabled. We need > // avoid the integer overflow here. > val defaultMaxNumExecutorFailures = math.max(3, > if (effectiveNumExecutors > Int.MaxValue / 2) Int.MaxValue else 2 * > effectiveNumExecutors) > > sparkConf.get(MAX_EXECUTOR_FAILURES).getOrElse(defaultMaxNumExecutorFailures) > } {code} > The result of defaultMaxNumExecutorFailures is calculated first, even if > {{MAX_EXECUTOR_FAILURES}} is configured now > > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46681) Refactor `ExecutorFailureTracker#maxNumExecutorFailures` to avoid unnecessary computations when `MAX_EXECUTOR_FAILURES` is configured
Yang Jie created SPARK-46681: Summary: Refactor `ExecutorFailureTracker#maxNumExecutorFailures` to avoid unnecessary computations when `MAX_EXECUTOR_FAILURES` is configured Key: SPARK-46681 URL: https://issues.apache.org/jira/browse/SPARK-46681 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Yang Jie {code:java} def maxNumExecutorFailures(sparkConf: SparkConf): Int = { val effectiveNumExecutors = if (Utils.isStreamingDynamicAllocationEnabled(sparkConf)) { sparkConf.get(STREAMING_DYN_ALLOCATION_MAX_EXECUTORS) } else if (Utils.isDynamicAllocationEnabled(sparkConf)) { sparkConf.get(DYN_ALLOCATION_MAX_EXECUTORS) } else { sparkConf.get(EXECUTOR_INSTANCES).getOrElse(0) } // By default, effectiveNumExecutors is Int.MaxValue if dynamic allocation is enabled. We need // avoid the integer overflow here. val defaultMaxNumExecutorFailures = math.max(3, if (effectiveNumExecutors > Int.MaxValue / 2) Int.MaxValue else 2 * effectiveNumExecutors) sparkConf.get(MAX_EXECUTOR_FAILURES).getOrElse(defaultMaxNumExecutorFailures) } {code} The result of defaultMaxNumExecutorFailures is calculated first, even if {{MAX_EXECUTOR_FAILURES}} is configured now -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46680) Upgrade Apache commons-pool2 to 2.12.0
[ https://issues.apache.org/jira/browse/SPARK-46680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46680: --- Labels: pull-request-available (was: ) > Upgrade Apache commons-pool2 to 2.12.0 > -- > > Key: SPARK-46680 > URL: https://issues.apache.org/jira/browse/SPARK-46680 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > > https://github.com/apache/commons-pool/blob/rel/commons-pool-2.12.0/RELEASE-NOTES.txt -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46680) Upgrade Apache commons-pool2 to 2.12.0
Yang Jie created SPARK-46680: Summary: Upgrade Apache commons-pool2 to 2.12.0 Key: SPARK-46680 URL: https://issues.apache.org/jira/browse/SPARK-46680 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: Yang Jie https://github.com/apache/commons-pool/blob/rel/commons-pool-2.12.0/RELEASE-NOTES.txt -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46368) Support `/readyz` API
[ https://issues.apache.org/jira/browse/SPARK-46368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46368: - Assignee: Dongjoon Hyun > Support `/readyz` API > - > > Key: SPARK-46368 > URL: https://issues.apache.org/jira/browse/SPARK-46368 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > Like https://kubernetes.io/docs/reference/using-api/health-checks/, we need > to provide `/readyz` API. > As a workaround, we can use the following. > {code} > readinessProbe: > exec: > command: ["sh", "-c", "! (curl -s > http://localhost:6066/v1/submissions/status/none | grep -q STANDBY)"] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46368) Support `readyz` API
[ https://issues.apache.org/jira/browse/SPARK-46368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46368: -- Summary: Support `readyz` API (was: Support `/readyz` API) > Support `readyz` API > > > Key: SPARK-46368 > URL: https://issues.apache.org/jira/browse/SPARK-46368 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > Like https://kubernetes.io/docs/reference/using-api/health-checks/, we need > to provide `/readyz` API. > As a workaround, we can use the following. > {code} > readinessProbe: > exec: > command: ["sh", "-c", "! (curl -s > http://localhost:6066/v1/submissions/status/none | grep -q STANDBY)"] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44638) Unable to read from JDBC data sources when using custom schema containing varchar
[ https://issues.apache.org/jira/browse/SPARK-44638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805505#comment-17805505 ] Kent Yao commented on SPARK-44638: -- Can you reproduce this issue on 3.5.0 or master branch? > Unable to read from JDBC data sources when using custom schema containing > varchar > - > > Key: SPARK-44638 > URL: https://issues.apache.org/jira/browse/SPARK-44638 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0, 3.2.4, 3.3.2, 3.4.1 >Reporter: Michael Said >Priority: Critical > > When querying the data from JDBC databases with custom schema containing > varchar I got this error : > {code:java} > [23/07/14 06:12:19 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) ( > executor 1): java.sql.SQLException: Unsupported type varchar(100) at > org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedJdbcTypeError(QueryExecutionErrors.scala:818) > 23/07/14 06:12:21 INFO TaskSetManager: Lost task 0.1 in stage 1.0 (TID 2) on > , executor 0: java.sql.SQLException (Unsupported type varchar(100)){code} > Code example: > {code:java} > CUSTOM_SCHEMA="ID Integer, NAME VARCHAR(100)" > df = spark.read.format("jdbc") > .option("url", "jdbc:oracle:thin:@0.0.0.0:1521:db") > .option("driver", "oracle.jdbc.OracleDriver") > .option("dbtable", "table") > .option("customSchema", CUSTOM_SCHEMA) > .option("user", "user") > .option("password", "password") > .load() > df.show(){code} > I tried to set {{spark.sql.legacy.charVarcharAsString = true}} to restore the > behavior before Spark 3.1 but it doesn't help. > The issue occurs in version 3.1.0 and above. I believe that this issue is > caused by https://issues.apache.org/jira/browse/SPARK-33480 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46679) Encoders with multiple inheritance - Key not found: T
Andoni Teso created SPARK-46679: --- Summary: Encoders with multiple inheritance - Key not found: T Key: SPARK-46679 URL: https://issues.apache.org/jira/browse/SPARK-46679 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0, 3.4.2 Reporter: Andoni Teso Attachments: spark_test.zip Since version 3.4, I've been experiencing the following error when using encoders. {code:java} Exception in thread "main" java.util.NoSuchElementException: key not found: T at scala.collection.immutable.Map$Map1.apply(Map.scala:163) at org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:121) at org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138) at org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138) at org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:60) at org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:53) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:62) at org.apache.spark.sql.Encoders$.bean(Encoders.scala:179) at org.apache.spark.sql.Encoders.bean(Encoders.scala) at org.example.Main.main(Main.java:26) {code} I'm attaching the code I use to reproduce the error locally. The issue is in the JavaTypeInference class when it tries to find the encoder for a ParameterizedType with the value Team. When running JavaTypeUtils.getTypeArguments(pt).asScala.toMap, it returns the type T again, but this time as a Company object, and pt.getRawType as Team. This ends up generating a tuple of Team, Company in the typeVariables map, leading to errors when searching for TypeVariables. In my case, I've resolved this by doing the following: {code:java} case tv: TypeVariable[_] => encoderFor(typeVariables.head._2, seenTypeSet, typeVariables) case pt: ParameterizedType => encoderFor(pt.getRawType, seenTypeSet, typeVariables) {code} I haven't submitted a pull request because it doesn't seem to be the most optimal solution, or it might break some parts of the code. Additional validations or conditions may need to be added. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46679) Encoders with multiple inheritance - Key not found: T
[ https://issues.apache.org/jira/browse/SPARK-46679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andoni Teso updated SPARK-46679: Attachment: spark_test.zip > Encoders with multiple inheritance - Key not found: T > - > > Key: SPARK-46679 > URL: https://issues.apache.org/jira/browse/SPARK-46679 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 3.5.0 >Reporter: Andoni Teso >Priority: Major > Attachments: spark_test.zip > > > Since version 3.4, I've been experiencing the following error when using > encoders. > {code:java} > Exception in thread "main" java.util.NoSuchElementException: key not found: T > at scala.collection.immutable.Map$Map1.apply(Map.scala:163) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:121) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:286) > at scala.collection.TraversableLike.map$(TraversableLike.scala:279) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:60) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:53) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:62) > at org.apache.spark.sql.Encoders$.bean(Encoders.scala:179) > at org.apache.spark.sql.Encoders.bean(Encoders.scala) > at org.example.Main.main(Main.java:26) {code} > I'm attaching the code I use to reproduce the error locally. > The issue is in the JavaTypeInference class when it tries to find the encoder > for a ParameterizedType with the value Team. When running > JavaTypeUtils.getTypeArguments(pt).asScala.toMap, it returns the type T > again, but this time as a Company object, and pt.getRawType as Team. This > ends up generating a tuple of Team, Company in the typeVariables map, leading > to errors when searching for TypeVariables. > In my case, I've resolved this by doing the following: > {code:java} > case tv: TypeVariable[_] => > encoderFor(typeVariables.head._2, seenTypeSet, typeVariables) > case pt: ParameterizedType => > encoderFor(pt.getRawType, seenTypeSet, typeVariables) {code} > I haven't submitted a pull request because it doesn't seem to be the most > optimal solution, or it might break some parts of the code. Additional > validations or conditions may need to be added. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46679) Encoders with multiple inheritance - Key not found: T
[ https://issues.apache.org/jira/browse/SPARK-46679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andoni Teso updated SPARK-46679: Description: Since version 3.4, I've been experiencing the following error when using encoders. {code:java} Exception in thread "main" java.util.NoSuchElementException: key not found: T at scala.collection.immutable.Map$Map1.apply(Map.scala:163) at org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:121) at org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138) at org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138) at org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:60) at org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:53) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:62) at org.apache.spark.sql.Encoders$.bean(Encoders.scala:179) at org.apache.spark.sql.Encoders.bean(Encoders.scala) at org.example.Main.main(Main.java:26) {code} I'm attaching the code I use to reproduce the error locally. [^spark_test.zip] The issue is in the JavaTypeInference class when it tries to find the encoder for a ParameterizedType with the value Team. When running JavaTypeUtils.getTypeArguments(pt).asScala.toMap, it returns the type T again, but this time as a Company object, and pt.getRawType as Team. This ends up generating a tuple of Team, Company in the typeVariables map, leading to errors when searching for TypeVariables. In my case, I've resolved this by doing the following: {code:java} case tv: TypeVariable[_] => encoderFor(typeVariables.head._2, seenTypeSet, typeVariables) case pt: ParameterizedType => encoderFor(pt.getRawType, seenTypeSet, typeVariables) {code} I haven't submitted a pull request because it doesn't seem to be the most optimal solution, or it might break some parts of the code. Additional validations or conditions may need to be added. was: Since version 3.4, I've been experiencing the following error when using encoders. {code:java} Exception in thread "main" java.util.NoSuchElementException: key not found: T at scala.collection.immutable.Map$Map1.apply(Map.scala:163) at org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:121) at org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138) at org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at
[jira] [Assigned] (SPARK-46678) Set datanucleus.autoStartMechanismMode=ignored to clean the wall of noisy logs
[ https://issues.apache.org/jira/browse/SPARK-46678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46678: - Assignee: Kent Yao > Set datanucleus.autoStartMechanismMode=ignored to clean the wall of noisy logs > -- > > Key: SPARK-46678 > URL: https://issues.apache.org/jira/browse/SPARK-46678 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > > {code:java} > [info] - 3.1: Decimal support of Avro Hive serde (1 second, 452 milliseconds) > 16:20:31.482 ERROR org.apache.spark.sql.execution.command.DDLUtils: Failed to > find data source: avro when check data column names. > org.apache.spark.sql.AnalysisException: Failed to find data source: avro. > Avro is built-in but external data source module since Spark 2.4. Please > deploy the application as per the deployment section of Apache Avro Data > Source Guide. > at > org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1630) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:660) > at > org.apache.spark.sql.execution.command.DDLUtils$.checkDataColNames(ddl.scala:1028) > at > org.apache.spark.sql.execution.command.DDLUtils$.$anonfun$checkTableColumns$1(ddl.scala:1016) > at > org.apache.spark.sql.execution.command.DDLUtils$.$anonfun$checkTableColumns$1$adapted(ddl.scala:1004) > at scala.Option.foreach(Option.scala:437) > at > org.apache.spark.sql.execution.command.DDLUtils$.checkTableColumns(ddl.scala:1004) > 00:20:31.485 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.486 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.487 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.489 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.490 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.496 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.497 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.500 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.582 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.583 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.587 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.590 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.591 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.594 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.598 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.599 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.602 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.603 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting >
[jira] [Resolved] (SPARK-46678) Set datanucleus.autoStartMechanismMode=ignored to clean the wall of noisy logs
[ https://issues.apache.org/jira/browse/SPARK-46678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46678. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44687 [https://github.com/apache/spark/pull/44687] > Set datanucleus.autoStartMechanismMode=ignored to clean the wall of noisy logs > -- > > Key: SPARK-46678 > URL: https://issues.apache.org/jira/browse/SPARK-46678 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > {code:java} > [info] - 3.1: Decimal support of Avro Hive serde (1 second, 452 milliseconds) > 16:20:31.482 ERROR org.apache.spark.sql.execution.command.DDLUtils: Failed to > find data source: avro when check data column names. > org.apache.spark.sql.AnalysisException: Failed to find data source: avro. > Avro is built-in but external data source module since Spark 2.4. Please > deploy the application as per the deployment section of Apache Avro Data > Source Guide. > at > org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1630) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:660) > at > org.apache.spark.sql.execution.command.DDLUtils$.checkDataColNames(ddl.scala:1028) > at > org.apache.spark.sql.execution.command.DDLUtils$.$anonfun$checkTableColumns$1(ddl.scala:1016) > at > org.apache.spark.sql.execution.command.DDLUtils$.$anonfun$checkTableColumns$1$adapted(ddl.scala:1004) > at scala.Option.foreach(Option.scala:437) > at > org.apache.spark.sql.execution.command.DDLUtils$.checkTableColumns(ddl.scala:1004) > 00:20:31.485 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.486 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.487 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.489 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.490 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.496 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.497 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.500 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.582 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.583 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.587 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.590 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.591 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.594 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.598 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.599 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.602 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.603
[jira] [Updated] (SPARK-46678) Set datanucleus.autoStartMechanismMode=ignored to clean the wall of noisy logs
[ https://issues.apache.org/jira/browse/SPARK-46678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46678: --- Labels: pull-request-available (was: ) > Set datanucleus.autoStartMechanismMode=ignored to clean the wall of noisy logs > -- > > Key: SPARK-46678 > URL: https://issues.apache.org/jira/browse/SPARK-46678 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > > {code:java} > [info] - 3.1: Decimal support of Avro Hive serde (1 second, 452 milliseconds) > 16:20:31.482 ERROR org.apache.spark.sql.execution.command.DDLUtils: Failed to > find data source: avro when check data column names. > org.apache.spark.sql.AnalysisException: Failed to find data source: avro. > Avro is built-in but external data source module since Spark 2.4. Please > deploy the application as per the deployment section of Apache Avro Data > Source Guide. > at > org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1630) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:660) > at > org.apache.spark.sql.execution.command.DDLUtils$.checkDataColNames(ddl.scala:1028) > at > org.apache.spark.sql.execution.command.DDLUtils$.$anonfun$checkTableColumns$1(ddl.scala:1016) > at > org.apache.spark.sql.execution.command.DDLUtils$.$anonfun$checkTableColumns$1$adapted(ddl.scala:1004) > at scala.Option.foreach(Option.scala:437) > at > org.apache.spark.sql.execution.command.DDLUtils$.checkTableColumns(ddl.scala:1004) > 00:20:31.485 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.486 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.487 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.489 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.490 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.496 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.497 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.500 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.582 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.583 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.587 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.590 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.591 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.594 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.598 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.599 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.602 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value: ignored > 00:20:31.603 WARN org.apache.hadoop.hive.metastore.ObjectStore: > datanucleus.autoStartMechanismMode is set to unsupported value null . Setting > it to value:
[jira] [Created] (SPARK-46678) Set datanucleus.autoStartMechanismMode=ignored to clean the wall of noisy logs
Kent Yao created SPARK-46678: Summary: Set datanucleus.autoStartMechanismMode=ignored to clean the wall of noisy logs Key: SPARK-46678 URL: https://issues.apache.org/jira/browse/SPARK-46678 Project: Spark Issue Type: Test Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao {code:java} [info] - 3.1: Decimal support of Avro Hive serde (1 second, 452 milliseconds) 16:20:31.482 ERROR org.apache.spark.sql.execution.command.DDLUtils: Failed to find data source: avro when check data column names. org.apache.spark.sql.AnalysisException: Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of Apache Avro Data Source Guide. at org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1630) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:660) at org.apache.spark.sql.execution.command.DDLUtils$.checkDataColNames(ddl.scala:1028) at org.apache.spark.sql.execution.command.DDLUtils$.$anonfun$checkTableColumns$1(ddl.scala:1016) at org.apache.spark.sql.execution.command.DDLUtils$.$anonfun$checkTableColumns$1$adapted(ddl.scala:1004) at scala.Option.foreach(Option.scala:437) at org.apache.spark.sql.execution.command.DDLUtils$.checkTableColumns(ddl.scala:1004) 00:20:31.485 WARN org.apache.hadoop.hive.metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 00:20:31.486 WARN org.apache.hadoop.hive.metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 00:20:31.487 WARN org.apache.hadoop.hive.metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 00:20:31.489 WARN org.apache.hadoop.hive.metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 00:20:31.490 WARN org.apache.hadoop.hive.metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 00:20:31.496 WARN org.apache.hadoop.hive.metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 00:20:31.497 WARN org.apache.hadoop.hive.metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 00:20:31.500 WARN org.apache.hadoop.hive.metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 00:20:31.582 WARN org.apache.hadoop.hive.metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 00:20:31.583 WARN org.apache.hadoop.hive.metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 00:20:31.587 WARN org.apache.hadoop.hive.metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 00:20:31.590 WARN org.apache.hadoop.hive.metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 00:20:31.591 WARN org.apache.hadoop.hive.metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 00:20:31.594 WARN org.apache.hadoop.hive.metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 00:20:31.598 WARN org.apache.hadoop.hive.metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 00:20:31.599 WARN org.apache.hadoop.hive.metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 00:20:31.602 WARN org.apache.hadoop.hive.metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored 00:20:31.603 WARN org.apache.hadoop.hive.metastore.ObjectStore: datanucleus.autoStartMechanismMode is set to unsupported value null . Setting it to value: ignored [info] - 3.1: read avro file containing decimal (135 milliseconds) 16:20:31.626 ERROR org.apache.spark.sql.execution.command.DDLUtils: Failed to find data source: avro when check data column names. org.apache.spark.sql.AnalysisException: Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of Apache Avro Data Source Guide. at
[jira] [Resolved] (SPARK-46672) Upgrade log4j2 to 2.22.1
[ https://issues.apache.org/jira/browse/SPARK-46672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46672. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44682 [https://github.com/apache/spark/pull/44682] > Upgrade log4j2 to 2.22.1 > > > Key: SPARK-46672 > URL: https://issues.apache.org/jira/browse/SPARK-46672 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46672) Upgrade log4j2 to 2.22.1
[ https://issues.apache.org/jira/browse/SPARK-46672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46672: - Assignee: Yang Jie > Upgrade log4j2 to 2.22.1 > > > Key: SPARK-46672 > URL: https://issues.apache.org/jira/browse/SPARK-46672 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46675) Remove unused inferTimestampNTZ in ParquetReadSupport
[ https://issues.apache.org/jira/browse/SPARK-46675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46675. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44686 [https://github.com/apache/spark/pull/44686] > Remove unused inferTimestampNTZ in ParquetReadSupport > - > > Key: SPARK-46675 > URL: https://issues.apache.org/jira/browse/SPARK-46675 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46675) Remove unused inferTimestampNTZ in ParquetReadSupport
[ https://issues.apache.org/jira/browse/SPARK-46675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46675: - Assignee: Cheng Pan > Remove unused inferTimestampNTZ in ParquetReadSupport > - > > Key: SPARK-46675 > URL: https://issues.apache.org/jira/browse/SPARK-46675 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46641) Add maxBytesPerTrigger threshold option
[ https://issues.apache.org/jira/browse/SPARK-46641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46641: -- Assignee: (was: Apache Spark) > Add maxBytesPerTrigger threshold option > --- > > Key: SPARK-46641 > URL: https://issues.apache.org/jira/browse/SPARK-46641 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Maksim Konstantinov >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46641) Add maxBytesPerTrigger threshold option
[ https://issues.apache.org/jira/browse/SPARK-46641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46641: -- Assignee: Apache Spark > Add maxBytesPerTrigger threshold option > --- > > Key: SPARK-46641 > URL: https://issues.apache.org/jira/browse/SPARK-46641 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Maksim Konstantinov >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46676) dropDuplicatesWithinWatermark throws error on canonicalizing plan
[ https://issues.apache.org/jira/browse/SPARK-46676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46676: -- Assignee: (was: Apache Spark) > dropDuplicatesWithinWatermark throws error on canonicalizing plan > - > > Key: SPARK-46676 > URL: https://issues.apache.org/jira/browse/SPARK-46676 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.5.0, 4.0.0 >Reporter: Jungtaek Lim >Priority: Major > Labels: pull-request-available > > Simply said, this test code fails: > {code:java} > test("SPARK-X: canonicalization of > StreamingDeduplicateWithinWatermarkExec should work") { > withTempDir { checkpoint => > val dedupeInputData = MemoryStream[(String, Int)] > val dedupe = dedupeInputData.toDS() > .withColumn("eventTime", timestamp_seconds($"_2")) > .withWatermark("eventTime", "10 second") > .dropDuplicatesWithinWatermark("_1") > .select($"_1", $"eventTime".cast("long").as[Long]) > testStream(dedupe, Append)( > StartStream(checkpointLocation = checkpoint.getCanonicalPath), > AddData(dedupeInputData, "a" -> 1), > CheckNewAnswer("a" -> 1), > Execute { q => > // This threw out error! > q.lastExecution.executedPlan.canonicalized > } > ) > } > } {code} > with below error: > {code:java} > [info] - SPARK-X: canonicalization of > StreamingDeduplicateWithinWatermarkExec should work *** FAILED *** (1 second, > 237 milliseconds) > [info] Assert on query failed: Execute: None.get > [info] scala.None$.get(Option.scala:627) > [info] scala.None$.get(Option.scala:626) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.(statefulOperators.scala:1101) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.copy(statefulOperators.scala:1092) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1148) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1087) > [info] > org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1210) > [info] > org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1208) > [info] > org.apache.spark.sql.execution.streaming.BaseStreamingDeduplicateExec.withNewChildrenInternal(statefulOperators.scala:949) > [info] > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:323) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46676) dropDuplicatesWithinWatermark throws error on canonicalizing plan
[ https://issues.apache.org/jira/browse/SPARK-46676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46676: -- Assignee: Apache Spark > dropDuplicatesWithinWatermark throws error on canonicalizing plan > - > > Key: SPARK-46676 > URL: https://issues.apache.org/jira/browse/SPARK-46676 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.5.0, 4.0.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Simply said, this test code fails: > {code:java} > test("SPARK-X: canonicalization of > StreamingDeduplicateWithinWatermarkExec should work") { > withTempDir { checkpoint => > val dedupeInputData = MemoryStream[(String, Int)] > val dedupe = dedupeInputData.toDS() > .withColumn("eventTime", timestamp_seconds($"_2")) > .withWatermark("eventTime", "10 second") > .dropDuplicatesWithinWatermark("_1") > .select($"_1", $"eventTime".cast("long").as[Long]) > testStream(dedupe, Append)( > StartStream(checkpointLocation = checkpoint.getCanonicalPath), > AddData(dedupeInputData, "a" -> 1), > CheckNewAnswer("a" -> 1), > Execute { q => > // This threw out error! > q.lastExecution.executedPlan.canonicalized > } > ) > } > } {code} > with below error: > {code:java} > [info] - SPARK-X: canonicalization of > StreamingDeduplicateWithinWatermarkExec should work *** FAILED *** (1 second, > 237 milliseconds) > [info] Assert on query failed: Execute: None.get > [info] scala.None$.get(Option.scala:627) > [info] scala.None$.get(Option.scala:626) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.(statefulOperators.scala:1101) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.copy(statefulOperators.scala:1092) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1148) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1087) > [info] > org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1210) > [info] > org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1208) > [info] > org.apache.spark.sql.execution.streaming.BaseStreamingDeduplicateExec.withNewChildrenInternal(statefulOperators.scala:949) > [info] > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:323) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46676) dropDuplicatesWithinWatermark throws error on canonicalizing plan
[ https://issues.apache.org/jira/browse/SPARK-46676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46676: -- Assignee: (was: Apache Spark) > dropDuplicatesWithinWatermark throws error on canonicalizing plan > - > > Key: SPARK-46676 > URL: https://issues.apache.org/jira/browse/SPARK-46676 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.5.0, 4.0.0 >Reporter: Jungtaek Lim >Priority: Major > Labels: pull-request-available > > Simply said, this test code fails: > {code:java} > test("SPARK-X: canonicalization of > StreamingDeduplicateWithinWatermarkExec should work") { > withTempDir { checkpoint => > val dedupeInputData = MemoryStream[(String, Int)] > val dedupe = dedupeInputData.toDS() > .withColumn("eventTime", timestamp_seconds($"_2")) > .withWatermark("eventTime", "10 second") > .dropDuplicatesWithinWatermark("_1") > .select($"_1", $"eventTime".cast("long").as[Long]) > testStream(dedupe, Append)( > StartStream(checkpointLocation = checkpoint.getCanonicalPath), > AddData(dedupeInputData, "a" -> 1), > CheckNewAnswer("a" -> 1), > Execute { q => > // This threw out error! > q.lastExecution.executedPlan.canonicalized > } > ) > } > } {code} > with below error: > {code:java} > [info] - SPARK-X: canonicalization of > StreamingDeduplicateWithinWatermarkExec should work *** FAILED *** (1 second, > 237 milliseconds) > [info] Assert on query failed: Execute: None.get > [info] scala.None$.get(Option.scala:627) > [info] scala.None$.get(Option.scala:626) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.(statefulOperators.scala:1101) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.copy(statefulOperators.scala:1092) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1148) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1087) > [info] > org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1210) > [info] > org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1208) > [info] > org.apache.spark.sql.execution.streaming.BaseStreamingDeduplicateExec.withNewChildrenInternal(statefulOperators.scala:949) > [info] > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:323) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46676) dropDuplicatesWithinWatermark throws error on canonicalizing plan
[ https://issues.apache.org/jira/browse/SPARK-46676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46676: -- Assignee: Apache Spark > dropDuplicatesWithinWatermark throws error on canonicalizing plan > - > > Key: SPARK-46676 > URL: https://issues.apache.org/jira/browse/SPARK-46676 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.5.0, 4.0.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Simply said, this test code fails: > {code:java} > test("SPARK-X: canonicalization of > StreamingDeduplicateWithinWatermarkExec should work") { > withTempDir { checkpoint => > val dedupeInputData = MemoryStream[(String, Int)] > val dedupe = dedupeInputData.toDS() > .withColumn("eventTime", timestamp_seconds($"_2")) > .withWatermark("eventTime", "10 second") > .dropDuplicatesWithinWatermark("_1") > .select($"_1", $"eventTime".cast("long").as[Long]) > testStream(dedupe, Append)( > StartStream(checkpointLocation = checkpoint.getCanonicalPath), > AddData(dedupeInputData, "a" -> 1), > CheckNewAnswer("a" -> 1), > Execute { q => > // This threw out error! > q.lastExecution.executedPlan.canonicalized > } > ) > } > } {code} > with below error: > {code:java} > [info] - SPARK-X: canonicalization of > StreamingDeduplicateWithinWatermarkExec should work *** FAILED *** (1 second, > 237 milliseconds) > [info] Assert on query failed: Execute: None.get > [info] scala.None$.get(Option.scala:627) > [info] scala.None$.get(Option.scala:626) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.(statefulOperators.scala:1101) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.copy(statefulOperators.scala:1092) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1148) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1087) > [info] > org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1210) > [info] > org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1208) > [info] > org.apache.spark.sql.execution.streaming.BaseStreamingDeduplicateExec.withNewChildrenInternal(statefulOperators.scala:949) > [info] > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:323) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46641) Add maxBytesPerTrigger threshold option
[ https://issues.apache.org/jira/browse/SPARK-46641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46641: -- Assignee: (was: Apache Spark) > Add maxBytesPerTrigger threshold option > --- > > Key: SPARK-46641 > URL: https://issues.apache.org/jira/browse/SPARK-46641 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Maksim Konstantinov >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46676) dropDuplicatesWithinWatermark throws error on canonicalizing plan
[ https://issues.apache.org/jira/browse/SPARK-46676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46676: --- Labels: pull-request-available (was: ) > dropDuplicatesWithinWatermark throws error on canonicalizing plan > - > > Key: SPARK-46676 > URL: https://issues.apache.org/jira/browse/SPARK-46676 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.5.0, 4.0.0 >Reporter: Jungtaek Lim >Priority: Major > Labels: pull-request-available > > Simply said, this test code fails: > {code:java} > test("SPARK-X: canonicalization of > StreamingDeduplicateWithinWatermarkExec should work") { > withTempDir { checkpoint => > val dedupeInputData = MemoryStream[(String, Int)] > val dedupe = dedupeInputData.toDS() > .withColumn("eventTime", timestamp_seconds($"_2")) > .withWatermark("eventTime", "10 second") > .dropDuplicatesWithinWatermark("_1") > .select($"_1", $"eventTime".cast("long").as[Long]) > testStream(dedupe, Append)( > StartStream(checkpointLocation = checkpoint.getCanonicalPath), > AddData(dedupeInputData, "a" -> 1), > CheckNewAnswer("a" -> 1), > Execute { q => > // This threw out error! > q.lastExecution.executedPlan.canonicalized > } > ) > } > } {code} > with below error: > {code:java} > [info] - SPARK-X: canonicalization of > StreamingDeduplicateWithinWatermarkExec should work *** FAILED *** (1 second, > 237 milliseconds) > [info] Assert on query failed: Execute: None.get > [info] scala.None$.get(Option.scala:627) > [info] scala.None$.get(Option.scala:626) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.(statefulOperators.scala:1101) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.copy(statefulOperators.scala:1092) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1148) > [info] > org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1087) > [info] > org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1210) > [info] > org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1208) > [info] > org.apache.spark.sql.execution.streaming.BaseStreamingDeduplicateExec.withNewChildrenInternal(statefulOperators.scala:949) > [info] > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:323) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46641) Add maxBytesPerTrigger threshold option
[ https://issues.apache.org/jira/browse/SPARK-46641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46641: -- Assignee: Apache Spark > Add maxBytesPerTrigger threshold option > --- > > Key: SPARK-46641 > URL: https://issues.apache.org/jira/browse/SPARK-46641 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Maksim Konstantinov >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46665) Remove Pandas dependency for pyspark.testing
[ https://issues.apache.org/jira/browse/SPARK-46665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46665: -- Assignee: Apache Spark > Remove Pandas dependency for pyspark.testing > > > Key: SPARK-46665 > URL: https://issues.apache.org/jira/browse/SPARK-46665 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > We should not make pyspark.testing depending on Pandas. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46665) Remove Pandas dependency for pyspark.testing
[ https://issues.apache.org/jira/browse/SPARK-46665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46665: -- Assignee: (was: Apache Spark) > Remove Pandas dependency for pyspark.testing > > > Key: SPARK-46665 > URL: https://issues.apache.org/jira/browse/SPARK-46665 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > > We should not make pyspark.testing depending on Pandas. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46641) Add maxBytesPerTrigger threshold option
[ https://issues.apache.org/jira/browse/SPARK-46641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46641: -- Assignee: (was: Apache Spark) > Add maxBytesPerTrigger threshold option > --- > > Key: SPARK-46641 > URL: https://issues.apache.org/jira/browse/SPARK-46641 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Maksim Konstantinov >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46675) Remove unused inferTimestampNTZ in ParquetReadSupport
[ https://issues.apache.org/jira/browse/SPARK-46675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46675: -- Assignee: Apache Spark > Remove unused inferTimestampNTZ in ParquetReadSupport > - > > Key: SPARK-46675 > URL: https://issues.apache.org/jira/browse/SPARK-46675 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46660) ReattachExecute requests do not refresh aliveness of SessionHolder
[ https://issues.apache.org/jira/browse/SPARK-46660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46660: -- Assignee: (was: Apache Spark) > ReattachExecute requests do not refresh aliveness of SessionHolder > -- > > Key: SPARK-46660 > URL: https://issues.apache.org/jira/browse/SPARK-46660 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 4.0.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > Labels: pull-request-available > > In the first executePlan request, creating the {{ExecuteHolder}} triggers > {{getOrCreateIsolatedSession}} which refreshes the aliveness of > {{{}SessionHolder{}}}. However in {{ReattachExecute}} , we fetch the > {{ExecuteHolder}} directly without going through the {{SessionHolder}} (and > hence making it seem like the {{SessionHolder}} is idle). > > This would result in long-running queries (which do not send release execute > requests since that refreshes aliveness) failing because the > {{SessionHolder}} would expire during active query execution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46675) Remove unused inferTimestampNTZ in ParquetReadSupport
[ https://issues.apache.org/jira/browse/SPARK-46675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46675: -- Assignee: (was: Apache Spark) > Remove unused inferTimestampNTZ in ParquetReadSupport > - > > Key: SPARK-46675 > URL: https://issues.apache.org/jira/browse/SPARK-46675 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46660) ReattachExecute requests do not refresh aliveness of SessionHolder
[ https://issues.apache.org/jira/browse/SPARK-46660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46660: -- Assignee: Apache Spark > ReattachExecute requests do not refresh aliveness of SessionHolder > -- > > Key: SPARK-46660 > URL: https://issues.apache.org/jira/browse/SPARK-46660 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 4.0.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > In the first executePlan request, creating the {{ExecuteHolder}} triggers > {{getOrCreateIsolatedSession}} which refreshes the aliveness of > {{{}SessionHolder{}}}. However in {{ReattachExecute}} , we fetch the > {{ExecuteHolder}} directly without going through the {{SessionHolder}} (and > hence making it seem like the {{SessionHolder}} is idle). > > This would result in long-running queries (which do not send release execute > requests since that refreshes aliveness) failing because the > {{SessionHolder}} would expire during active query execution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46675) Remove unused inferTimestampNTZ in ParquetReadSupport
[ https://issues.apache.org/jira/browse/SPARK-46675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46675: -- Assignee: (was: Apache Spark) > Remove unused inferTimestampNTZ in ParquetReadSupport > - > > Key: SPARK-46675 > URL: https://issues.apache.org/jira/browse/SPARK-46675 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46675) Remove unused inferTimestampNTZ in ParquetReadSupport
[ https://issues.apache.org/jira/browse/SPARK-46675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-46675: -- Assignee: Apache Spark > Remove unused inferTimestampNTZ in ParquetReadSupport > - > > Key: SPARK-46675 > URL: https://issues.apache.org/jira/browse/SPARK-46675 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46676) dropDuplicatesWithinWatermark throws error on canonicalizing plan
Jungtaek Lim created SPARK-46676: Summary: dropDuplicatesWithinWatermark throws error on canonicalizing plan Key: SPARK-46676 URL: https://issues.apache.org/jira/browse/SPARK-46676 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.5.0, 4.0.0 Reporter: Jungtaek Lim Simply said, this test code fails: {code:java} test("SPARK-X: canonicalization of StreamingDeduplicateWithinWatermarkExec should work") { withTempDir { checkpoint => val dedupeInputData = MemoryStream[(String, Int)] val dedupe = dedupeInputData.toDS() .withColumn("eventTime", timestamp_seconds($"_2")) .withWatermark("eventTime", "10 second") .dropDuplicatesWithinWatermark("_1") .select($"_1", $"eventTime".cast("long").as[Long]) testStream(dedupe, Append)( StartStream(checkpointLocation = checkpoint.getCanonicalPath), AddData(dedupeInputData, "a" -> 1), CheckNewAnswer("a" -> 1), Execute { q => // This threw out error! q.lastExecution.executedPlan.canonicalized } ) } } {code} with below error: {code:java} [info] - SPARK-X: canonicalization of StreamingDeduplicateWithinWatermarkExec should work *** FAILED *** (1 second, 237 milliseconds) [info] Assert on query failed: Execute: None.get [info] scala.None$.get(Option.scala:627) [info] scala.None$.get(Option.scala:626) [info] org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.(statefulOperators.scala:1101) [info] org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.copy(statefulOperators.scala:1092) [info] org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1148) [info] org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1087) [info] org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1210) [info] org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1208) [info] org.apache.spark.sql.execution.streaming.BaseStreamingDeduplicateExec.withNewChildrenInternal(statefulOperators.scala:949) [info] org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:323) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46671) InferFiltersFromConstraint rule is creating a redundant filter
[ https://issues.apache.org/jira/browse/SPARK-46671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif resolved SPARK-46671. -- Resolution: Not A Bug > InferFiltersFromConstraint rule is creating a redundant filter > -- > > Key: SPARK-46671 > URL: https://issues.apache.org/jira/browse/SPARK-46671 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Minor > Labels: SQL, catalyst > > while bring my old PR which uses a different approach to the > ConstraintPropagation algorithm ( > [SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch > with current master, I noticed a test failure in my branch for SPARK-33152: > The test which is failing is > InferFiltersFromConstraintSuite: > {code} > test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: > Infer Filters") { > val x = testRelation.as("x") > val y = testRelation.as("y") > val z = testRelation.as("z") > // Removes EqualNullSafe when constructing candidate constraints > comparePlans( > InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa")) > .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze), > x.select($"x.a", $"x.a".as("xa")) > .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && > $"xa" === $"x.a").analyze) > // Once strategy's idempotence is not broken > val originalQuery = > x.join(y, condition = Some($"x.a" === $"y.a")) > .select($"x.a", $"x.a".as("xa")).as("xy") > .join(z, condition = Some($"xy.a" === $"z.a")).analyze > val correctAnswer = > x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = > Some($"x.a" === $"y.a")) > .select($"x.a", $"x.a".as("xa")).as("xy") > .join(z.where($"a".isNotNull), condition = Some($"xy.a" === > $"z.a")).analyze > val optimizedQuery = InferFiltersFromConstraints(originalQuery) > comparePlans(optimizedQuery, correctAnswer) > comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer) > } > {code} > In the above test, I believe the below assertion is not proper. > There is a redundant filter which is getting created. > Out of these two isNotNull constraints, only one should be created. > $"xa".isNotNull && $"x.a".isNotNull > Because presence of (xa#0 = a#0), automatically implies that is one > attribute is not null, the other also has to be not null. > // Removes EqualNullSafe when constructing candidate constraints > comparePlans( > InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa")) > .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze), > x.select($"x.a", $"x.a".as("xa")) > .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && > $"xa" === $"x.a").analyze) > This is not a big issue, but it highlights the need to take a relook at the > code of ConstraintPropagation and related code. > I am filing this jira so that constraint code can be tightened/made more > robust. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46671) InferFiltersFromConstraint rule is creating a redundant filter
[ https://issues.apache.org/jira/browse/SPARK-46671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805434#comment-17805434 ] Asif commented on SPARK-46671: -- on further thoughts , I am wrong.. There should be 2 separate isNotNull constraints.. > InferFiltersFromConstraint rule is creating a redundant filter > -- > > Key: SPARK-46671 > URL: https://issues.apache.org/jira/browse/SPARK-46671 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Minor > Labels: SQL, catalyst > > while bring my old PR which uses a different approach to the > ConstraintPropagation algorithm ( > [SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch > with current master, I noticed a test failure in my branch for SPARK-33152: > The test which is failing is > InferFiltersFromConstraintSuite: > {code} > test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: > Infer Filters") { > val x = testRelation.as("x") > val y = testRelation.as("y") > val z = testRelation.as("z") > // Removes EqualNullSafe when constructing candidate constraints > comparePlans( > InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa")) > .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze), > x.select($"x.a", $"x.a".as("xa")) > .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && > $"xa" === $"x.a").analyze) > // Once strategy's idempotence is not broken > val originalQuery = > x.join(y, condition = Some($"x.a" === $"y.a")) > .select($"x.a", $"x.a".as("xa")).as("xy") > .join(z, condition = Some($"xy.a" === $"z.a")).analyze > val correctAnswer = > x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = > Some($"x.a" === $"y.a")) > .select($"x.a", $"x.a".as("xa")).as("xy") > .join(z.where($"a".isNotNull), condition = Some($"xy.a" === > $"z.a")).analyze > val optimizedQuery = InferFiltersFromConstraints(originalQuery) > comparePlans(optimizedQuery, correctAnswer) > comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer) > } > {code} > In the above test, I believe the below assertion is not proper. > There is a redundant filter which is getting created. > Out of these two isNotNull constraints, only one should be created. > $"xa".isNotNull && $"x.a".isNotNull > Because presence of (xa#0 = a#0), automatically implies that is one > attribute is not null, the other also has to be not null. > // Removes EqualNullSafe when constructing candidate constraints > comparePlans( > InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa")) > .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze), > x.select($"x.a", $"x.a".as("xa")) > .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && > $"xa" === $"x.a").analyze) > This is not a big issue, but it highlights the need to take a relook at the > code of ConstraintPropagation and related code. > I am filing this jira so that constraint code can be tightened/made more > robust. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46671) InferFiltersFromConstraint rule is creating a redundant filter
[ https://issues.apache.org/jira/browse/SPARK-46671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805435#comment-17805435 ] Asif commented on SPARK-46671: -- so closing the ticket > InferFiltersFromConstraint rule is creating a redundant filter > -- > > Key: SPARK-46671 > URL: https://issues.apache.org/jira/browse/SPARK-46671 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Minor > Labels: SQL, catalyst > > while bring my old PR which uses a different approach to the > ConstraintPropagation algorithm ( > [SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch > with current master, I noticed a test failure in my branch for SPARK-33152: > The test which is failing is > InferFiltersFromConstraintSuite: > {code} > test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: > Infer Filters") { > val x = testRelation.as("x") > val y = testRelation.as("y") > val z = testRelation.as("z") > // Removes EqualNullSafe when constructing candidate constraints > comparePlans( > InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa")) > .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze), > x.select($"x.a", $"x.a".as("xa")) > .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && > $"xa" === $"x.a").analyze) > // Once strategy's idempotence is not broken > val originalQuery = > x.join(y, condition = Some($"x.a" === $"y.a")) > .select($"x.a", $"x.a".as("xa")).as("xy") > .join(z, condition = Some($"xy.a" === $"z.a")).analyze > val correctAnswer = > x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = > Some($"x.a" === $"y.a")) > .select($"x.a", $"x.a".as("xa")).as("xy") > .join(z.where($"a".isNotNull), condition = Some($"xy.a" === > $"z.a")).analyze > val optimizedQuery = InferFiltersFromConstraints(originalQuery) > comparePlans(optimizedQuery, correctAnswer) > comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer) > } > {code} > In the above test, I believe the below assertion is not proper. > There is a redundant filter which is getting created. > Out of these two isNotNull constraints, only one should be created. > $"xa".isNotNull && $"x.a".isNotNull > Because presence of (xa#0 = a#0), automatically implies that is one > attribute is not null, the other also has to be not null. > // Removes EqualNullSafe when constructing candidate constraints > comparePlans( > InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa")) > .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze), > x.select($"x.a", $"x.a".as("xa")) > .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && > $"xa" === $"x.a").analyze) > This is not a big issue, but it highlights the need to take a relook at the > code of ConstraintPropagation and related code. > I am filing this jira so that constraint code can be tightened/made more > robust. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46675) Remove unused inferTimestampNTZ in ParquetReadSupport
[ https://issues.apache.org/jira/browse/SPARK-46675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46675: --- Labels: pull-request-available (was: ) > Remove unused inferTimestampNTZ in ParquetReadSupport > - > > Key: SPARK-46675 > URL: https://issues.apache.org/jira/browse/SPARK-46675 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46675) Remove unused inferTimestampNTZ in ParquetReadSupport
Cheng Pan created SPARK-46675: - Summary: Remove unused inferTimestampNTZ in ParquetReadSupport Key: SPARK-46675 URL: https://issues.apache.org/jira/browse/SPARK-46675 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46668) Parallelize Sphinx build of Python API docs
[ https://issues.apache.org/jira/browse/SPARK-46668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-46668: Assignee: Nicholas Chammas > Parallelize Sphinx build of Python API docs > --- > > Key: SPARK-46668 > URL: https://issues.apache.org/jira/browse/SPARK-46668 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46668) Parallelize Sphinx build of Python API docs
[ https://issues.apache.org/jira/browse/SPARK-46668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-46668. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44680 [https://github.com/apache/spark/pull/44680] > Parallelize Sphinx build of Python API docs > --- > > Key: SPARK-46668 > URL: https://issues.apache.org/jira/browse/SPARK-46668 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org