date:20240111

[jira] [Assigned] (SPARK-46694) Drop the assumptions of 'hive version < 2.0' in Hive version related tests

2024-01-11 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-46694:


Assignee: Kent Yao

> Drop the assumptions of 'hive version < 2.0' in Hive version related tests
> --
>
> Key: SPARK-46694
> URL: https://issues.apache.org/jira/browse/SPARK-46694
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46694) Drop the assumptions of 'hive version < 2.0' in Hive version related tests

2024-01-11 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-46694.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44700
[https://github.com/apache/spark/pull/44700]

> Drop the assumptions of 'hive version < 2.0' in Hive version related tests
> --
>
> Key: SPARK-46694
> URL: https://issues.apache.org/jira/browse/SPARK-46694
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46696) In ResourceProfileManager, function calls should occur after variable declarations.

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46696:
---
Labels: pull-request-available  (was: )

> In ResourceProfileManager, function calls should occur after variable 
> declarations.
> ---
>
> Key: SPARK-46696
> URL: https://issues.apache.org/jira/browse/SPARK-46696
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: liangyongyuan
>Priority: Major
>  Labels: pull-request-available
>
> As the title suggests, in *ResourceProfileManager*, function calls should be 
> made after variable declarations. When determining *isSupport*, all variables 
> are uninitialized, with booleans defaulting to false and objects to null. 
> While the end result is correct, the evaluation process is abnormal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46696) In ResourceProfileManager, function calls should occur after variable declarations.

2024-01-11 Thread liangyongyuan (Jira)

liangyongyuan created SPARK-46696:
-

 Summary: In ResourceProfileManager, function calls should occur 
after variable declarations.
 Key: SPARK-46696
 URL: https://issues.apache.org/jira/browse/SPARK-46696
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: liangyongyuan


As the title suggests, in *ResourceProfileManager*, function calls should be 
made after variable declarations. When determining *isSupport*, all variables 
are uninitialized, with booleans defaulting to false and objects to null. While 
the end result is correct, the evaluation process is abnormal.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46695) Always setting hive.execution.engine to mr

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46695:
---
Labels: pull-request-available  (was: )

> Always setting hive.execution.engine to mr
> --
>
> Key: SPARK-46695
> URL: https://issues.apache.org/jira/browse/SPARK-46695
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46695) Always setting hive.execution.engine to mr

2024-01-11 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-46695:
-

 Summary: Always setting hive.execution.engine to mr
 Key: SPARK-46695
 URL: https://issues.apache.org/jira/browse/SPARK-46695
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46694) Drop the assumptions of 'hive version < 2.0' in Hive version related tests

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46694:
---
Labels: pull-request-available  (was: )

> Drop the assumptions of 'hive version < 2.0' in Hive version related tests
> --
>
> Key: SPARK-46694
> URL: https://issues.apache.org/jira/browse/SPARK-46694
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46694) Drop the assumptions of 'hive version < 2.0' in Hive version related tests

2024-01-11 Thread Kent Yao (Jira)

Kent Yao created SPARK-46694:


 Summary: Drop the assumptions of 'hive version < 2.0' in Hive 
version related tests
 Key: SPARK-46694
 URL: https://issues.apache.org/jira/browse/SPARK-46694
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46429) avoid duplicate Classes and Resources in classpath of SPARK_HOME/jars/*.jar

2024-01-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-46429:
-
Affects Version/s: (was: 3.5.2)

> avoid duplicate Classes and Resources in classpath of SPARK_HOME/jars/*.jar
> ---
>
> Key: SPARK-46429
> URL: https://issues.apache.org/jira/browse/SPARK-46429
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Arnaud Nauwynck
>Priority: Minor
>
> There are 3679 duplicate resources (classes and other files) in the classpath 
> of "${SPARK_HOME}/jars/*.jar", amoung the 90756 classes.
> This does not have impact for spark itself  (eventhough it might have),
> but is annoying for end-users who want to check they do not redeploy 
> additionnal redundant classes already in the classpath of spark runtime + 
> hadoop + cloud specific environment.
> At compile-time, it is possible to check for such duplicate classes using for 
> example the maven plugin com.github.eirslett:maven-versions-plugin, but at 
> runtime, it is much more difficult because you might discover latelly the 
> provisionned environment you are running on (example: Azure HDInsight, etc..)
> Here is a minimalist sample code to check for duplicate classes, and printing 
> a summary report by duplicate jars: 
> [https://github.com/Arnaud-Nauwynck/test-snippets/tree/master/test-classgraph-duplicate|https://github.com/Arnaud-Nauwynck/test-snippets/tree/master/test-classgraph-duplicate]
> Running it on the bare spark 3.5.0 distribution, we get theses warnings:
> We see that many guava classes are packaged twice, because the shaded
> "hadoop-client-runtime-3.3.4.jar" (with 18626 resources) has 927 duplicate(s) 
> also in "hadoop-shaded-guava-1.1.1.jar" (with 2428 resources)
> Another example: "javax.jdo-3.2.0-m3.jar" (with 252 resources) has 174 
> duplicate(s) in  "jdo-api-3.0.1.jar" (with 213 resources). It is quite clear 
> that "javax.jdo-3.2.0-m3.jar" already contains a source copy of all the 
> classes of "jdo-api" jar, instead of defining a maven dependency. (see for 
> example the pom: 
> https://github.com/datanucleus/javax.jdo/blob/master/pom.xml#L51
> , and some class copy :  
> https://github.com/datanucleus/javax.jdo/blob/master/src/main/java/javax/jdo/annotations/ForeignKey.java#L35
>  )
> In summary, we can see duplicates for classes in "guava", "checkerframework", 
> "parquet", "jdo-api", "jta", "orc", etc.
> {noformat}
> scanned  90756 classes
> found 3679 resource duplicate(s)
> Found duplicate resources among 256 x META-INF/MANIFEST, 22 x 
> META-INF/INDEX.LIST, 25 x META-INF/jandex.idx, 604 x other META-INF/**, 
>   3 x NOTICE, 3 x LICENSE, 
>   30 x package-info.class, 20 x module-info.class, 
>   4284 x inner classes, 22 x UnusedStubClass, 
>   20 x manifest.vm, 21 x schema/validation-schema.json, 21 x 
> schema/kube-schema.json, 
> Jar C:\apps\spark\spark-3.5.0\jars\datanucleus-api-jdo-4.2.4.jar (with 151 
> resources) has 1 duplicate in 
> C:\apps\spark\spark-3.5.0\jars\datanucleus-rdbms-4.1.19.jar (with 781 
> resources)
>for resources plugin.xml
> Jar C:\apps\spark\spark-3.5.0\jars\hadoop-client-runtime-3.3.4.jar (with 
> 18626 resources) has 927 duplicate(s) in 
> C:\apps\spark\spark-3.5.0\jars\hadoop-shaded-guava-1.1.1.jar (with 2428 
> resources)
>for resources with common prefix 'org/apache/hadoop/thirdparty/': 
> com/google/common/reflect/Reflection.class, 
> com/google/errorprone/annotations/CompatibleWith.class, 
> com/google/common/reflect/AbstractInvocationHandler.class, 
> com/google/common/graph/Traverser.class, 
> com/google/common/base/FinalizableSoftReference.class, 
> com/google/common/collect/AbstractSortedSetMultimap.class, 
> com/google/common/cache/Cache.class, 
> com/google/common/graph/UndirectedNetworkConnections.class, 
> com/google/common/hash/LongAddable.class, 
> com/google/common/io/ByteSource.class, 
> com/google/common/collect/SparseImmutableTable.class, 
> com/google/common/primitives/ImmutableDoubleArray.class, 
> org/checkerframework/checker/nullness/qual/EnsuresNonNullIf.class, 
> com/google/common/io/FileBackedOutputStream.class, 
> com/google/common/collect/SortedMultisetBridge.class, 
> com/google/common/collect/ImmutableListMultimap.class, 
> org/checkerframework/checker/units/qual/Length.class, 
> org/checkerframework/framework/qual/MonotonicQualifier.class, 
> org/checkerframework/checker/units/qual/m2.class, 
> com/google/common/collect/ImmutableMultimap.class, 
> org/checkerframework/common/util/report/qual/ReportUnqualified.class, 
> com/google/common/collect/Range.class, 
> com/google/common/hash/LittleEndianByteArray.class, 
> com/google/common/collect/Serialization.class, 
> com/google/common/collect/BoundType.class,

[jira] [Updated] (SPARK-46684) CoGroup.applyInPandas/Arrow should pass arguments properly

2024-01-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-46684:
-
Fix Version/s: 3.5.1

> CoGroup.applyInPandas/Arrow should pass arguments properly
> --
>
> Key: SPARK-46684
> URL: https://issues.apache.org/jira/browse/SPARK-46684
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.1
>
>
> In Spark Connect, {{CoGroup.applyInPandas/Arrow}} doesn't take arguments 
> properly, so the arguments of the UDF can be broken:
> {noformat}
> >>> import pandas as pd
> >>>
> >>> df1 = spark.createDataFrame(
> ... [(1, 1.0, "a"), (2, 2.0, "b"), (1, 3.0, "c"), (2, 4.0, "d")], ("id", 
> "v1", "v2")
> ... )
> >>> df2 = spark.createDataFrame([(1, "x"), (2, "y"), (1, "z")], ("id", "v3"))
> >>>
> >>> def summarize(left, right):
> ... return pd.DataFrame(
> ... {
> ... "left_rows": [len(left)],
> ... "left_columns": [len(left.columns)],
> ... "right_rows": [len(right)],
> ... "right_columns": [len(right.columns)],
> ... }
> ... )
> ...
> >>> df = (
> ... df1.groupby("id")
> ... .cogroup(df2.groupby("id"))
> ... .applyInPandas(
> ... summarize,
> ... schema="left_rows long, left_columns long, right_rows long, 
> right_columns long",
> ... )
> ... )
> >>>
> >>> df.show()
> +-++--+-+
> |left_rows|left_columns|right_rows|right_columns|
> +-++--+-+
> |2|   1| 2|1|
> |2|   1| 1|1|
> +-++--+-+
> {noformat}
> The result should be:
> {noformat}
> +-++--+-+
> |left_rows|left_columns|right_rows|right_columns|
> +-++--+-+
> |        2|           3|         2|            2|
> |        2|           3|         1|            2|
> +-++--+-+
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46684) CoGroup.applyInPandas/Arrow should pass arguments properly

2024-01-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-46684:


Assignee: Takuya Ueshin

> CoGroup.applyInPandas/Arrow should pass arguments properly
> --
>
> Key: SPARK-46684
> URL: https://issues.apache.org/jira/browse/SPARK-46684
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>  Labels: pull-request-available
>
> In Spark Connect, {{CoGroup.applyInPandas/Arrow}} doesn't take arguments 
> properly, so the arguments of the UDF can be broken:
> {noformat}
> >>> import pandas as pd
> >>>
> >>> df1 = spark.createDataFrame(
> ... [(1, 1.0, "a"), (2, 2.0, "b"), (1, 3.0, "c"), (2, 4.0, "d")], ("id", 
> "v1", "v2")
> ... )
> >>> df2 = spark.createDataFrame([(1, "x"), (2, "y"), (1, "z")], ("id", "v3"))
> >>>
> >>> def summarize(left, right):
> ... return pd.DataFrame(
> ... {
> ... "left_rows": [len(left)],
> ... "left_columns": [len(left.columns)],
> ... "right_rows": [len(right)],
> ... "right_columns": [len(right.columns)],
> ... }
> ... )
> ...
> >>> df = (
> ... df1.groupby("id")
> ... .cogroup(df2.groupby("id"))
> ... .applyInPandas(
> ... summarize,
> ... schema="left_rows long, left_columns long, right_rows long, 
> right_columns long",
> ... )
> ... )
> >>>
> >>> df.show()
> +-++--+-+
> |left_rows|left_columns|right_rows|right_columns|
> +-++--+-+
> |2|   1| 2|1|
> |2|   1| 1|1|
> +-++--+-+
> {noformat}
> The result should be:
> {noformat}
> +-++--+-+
> |left_rows|left_columns|right_rows|right_columns|
> +-++--+-+
> |        2|           3|         2|            2|
> |        2|           3|         1|            2|
> +-++--+-+
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46684) CoGroup.applyInPandas/Arrow should pass arguments properly

2024-01-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-46684.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44695
[https://github.com/apache/spark/pull/44695]

> CoGroup.applyInPandas/Arrow should pass arguments properly
> --
>
> Key: SPARK-46684
> URL: https://issues.apache.org/jira/browse/SPARK-46684
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> In Spark Connect, {{CoGroup.applyInPandas/Arrow}} doesn't take arguments 
> properly, so the arguments of the UDF can be broken:
> {noformat}
> >>> import pandas as pd
> >>>
> >>> df1 = spark.createDataFrame(
> ... [(1, 1.0, "a"), (2, 2.0, "b"), (1, 3.0, "c"), (2, 4.0, "d")], ("id", 
> "v1", "v2")
> ... )
> >>> df2 = spark.createDataFrame([(1, "x"), (2, "y"), (1, "z")], ("id", "v3"))
> >>>
> >>> def summarize(left, right):
> ... return pd.DataFrame(
> ... {
> ... "left_rows": [len(left)],
> ... "left_columns": [len(left.columns)],
> ... "right_rows": [len(right)],
> ... "right_columns": [len(right.columns)],
> ... }
> ... )
> ...
> >>> df = (
> ... df1.groupby("id")
> ... .cogroup(df2.groupby("id"))
> ... .applyInPandas(
> ... summarize,
> ... schema="left_rows long, left_columns long, right_rows long, 
> right_columns long",
> ... )
> ... )
> >>>
> >>> df.show()
> +-++--+-+
> |left_rows|left_columns|right_rows|right_columns|
> +-++--+-+
> |2|   1| 2|1|
> |2|   1| 1|1|
> +-++--+-+
> {noformat}
> The result should be:
> {noformat}
> +-++--+-+
> |left_rows|left_columns|right_rows|right_columns|
> +-++--+-+
> |        2|           3|         2|            2|
> |        2|           3|         1|            2|
> +-++--+-+
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46588) Interrupt when executing ANALYSIS phase

2024-01-11 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-46588.
--
Resolution: Information Provided

Jira is not a suitable place for questions, you'd better use the user mailing 
lists

> Interrupt when executing ANALYSIS phase
> ---
>
> Key: SPARK-46588
> URL: https://issues.apache.org/jira/browse/SPARK-46588
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: JacobZheng
>Priority: Major
>
> I have a long-running spark app on which I will start many tasks. When I am 
> executing complex tasks, I may spend a lot of time in the ANALYSIS phase or 
> the OPTIMIZATION phase or run into oom exceptions. I will cancel the task 
> when timeout is detected by calling cancelJobGroup method. However, the task 
> is not interrupted and the execution plan is still being generated. I{*}s 
> there a way to interrupt these phases?{*}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46693) Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46693:
---
Labels: pull-request-available  (was: )

> Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset
> 
>
> Key: SPARK-46693
> URL: https://issues.apache.org/jira/browse/SPARK-46693
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.5.0
>Reporter: Nick Young
>Priority: Major
>  Labels: pull-request-available
>
> For queries containing both a LIMIT and an OFFSET in a subquery, physical 
> translation will drop the `LocalLimit` planned in the optimizer stage by 
> mistake; this manifests as larger than necessary shuffle sizes for 
> `GlobalLimitExec`. Fix to not drop this node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46693) Inject LocalLimitExec when matching OffsetAndLimit or LimitAndOffset

2024-01-11 Thread Nick Young (Jira)

Nick Young created SPARK-46693:
--

 Summary: Inject LocalLimitExec when matching OffsetAndLimit or 
LimitAndOffset
 Key: SPARK-46693
 URL: https://issues.apache.org/jira/browse/SPARK-46693
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 3.5.0
Reporter: Nick Young


For queries containing both a LIMIT and an OFFSET in a subquery, physical 
translation will drop the `LocalLimit` planned in the optimizer stage by 
mistake; this manifests as larger than necessary shuffle sizes for 
`GlobalLimitExec`. Fix to not drop this node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46588) Interrupt when executing ANALYSIS phase

2024-01-11 Thread Kent Yao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805873#comment-17805873
 ] 

Kent Yao commented on SPARK-46588:
--

You can call sc.setInterruptOnCancel(true) to interrupt the running task.

 

After spark 4.0, you can also set spark.sql.execution.interruptOnCancel=true 
using configrations

> Interrupt when executing ANALYSIS phase
> ---
>
> Key: SPARK-46588
> URL: https://issues.apache.org/jira/browse/SPARK-46588
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: JacobZheng
>Priority: Major
>
> I have a long-running spark app on which I will start many tasks. When I am 
> executing complex tasks, I may spend a lot of time in the ANALYSIS phase or 
> the OPTIMIZATION phase or run into oom exceptions. I will cancel the task 
> when timeout is detected by calling cancelJobGroup method. However, the task 
> is not interrupted and the execution plan is still being generated. I{*}s 
> there a way to interrupt these phases?{*}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46612) Clickhouse's JDBC throws `java.lang.IllegalArgumentException: Unknown data type: string` when write array string with Apache Spark scala

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46612:
---
Labels: pull-request-available  (was: )

> Clickhouse's JDBC throws `java.lang.IllegalArgumentException: Unknown data 
> type: string` when write array string with Apache Spark scala
> 
>
> Key: SPARK-46612
> URL: https://issues.apache.org/jira/browse/SPARK-46612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Nguyen Phan Huy
>Priority: Major
>  Labels: pull-request-available
>
> Issue is also reported on Clickhouse's github: 
> [https://github.com/ClickHouse/clickhouse-java/issues/1505] 
> h3. Bug description
> When using Scala spark to write an array of string to Clickhouse, the driver 
> throws {{java.lang.IllegalArgumentException: Unknown data type: string}} 
> exception.
> Exception is thrown by: 
> [https://github.com/ClickHouse/clickhouse-java/blob/aa3870eadb1a2d3675fd5119714c85851800f076/clickhouse-data/src/main/java/com/clickhouse/data/ClickHouseDataType.java#L238]
> This was caused by Spark JDBC Utils tried to cast the type to lower case 
> ({{{}String{}}} -> {{{}string{}}}).
> [https://github.com/apache/spark/blob/6b931530d75cb4f00236f9c6283de8ef450963ad/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L639]
> h3. Steps to reproduce
>  # Create Clickhouse table with String Array field 
> ([https://clickhouse.com/]).
>  # Write data to the table with scala Spark, via Clickhouse's JDBC 
> ([https://github.com/ClickHouse/clickhouse-java)] 
> {code:java}
>// code extraction, will need to setup a Scala Spark job with clickhouse 
> jdbc
> val clickHouseSchema = StructType(
>   Seq(
> StructField("str_array", ArrayType(StringType))
>   )
> )
> val data = Seq(
>   Row(
> Seq("a", "b")
>   )
> )
> val clickHouseDf = spark.createDataFrame(sc.parallelize(data), 
> clickHouseSchema)
>
> val props = new Properties
> props.put("user", "default")
> clickHouseDf.write
>   .mode(SaveMode.Append)
>   .option("driver", com.clickhouse.jdbc.ClickHouseDriver)
>   .jdbc("jdbc:clickhouse://localhost:8123/foo", table = "bar", props) 
> {code}
> h2. Fix
>  - [https://github.com/apache/spark/pull/44459] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46650) Replace AtomicBoolean with volatile boolean

2024-01-11 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-46650.
--
Resolution: Not A Problem

> Replace AtomicBoolean with volatile boolean
> ---
>
> Key: SPARK-46650
> URL: https://issues.apache.org/jira/browse/SPARK-46650
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25895) No test to compare Zstd and Lz4 Compression Algorithm

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-25895:
---
Labels: pull-request-available  (was: )

> No test to compare Zstd and Lz4 Compression Algorithm
> -
>
> Key: SPARK-25895
> URL: https://issues.apache.org/jira/browse/SPARK-25895
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Udbhav Agrawal
>Priority: Minor
>  Labels: pull-request-available
>
> As per Jira SPARK-19112 Zstd Compression ratio is better than the default 
> compression Codec i.e lz4, this test compares the shuffle spill, shuffle read 
> and shuffle write values when both the compression codec is used and there 
> was no UT to verify the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46692) Fix potential issues with environment variable transmission `PYTHON_TO_TEST`

2024-01-11 Thread BingKun Pan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-46692:

Summary: Fix potential issues with environment variable transmission 
`PYTHON_TO_TEST`  (was: Fix potential issues with environment variable 
transmission `PYTHON_TO_TEST` in `build_python`)

> Fix potential issues with environment variable transmission `PYTHON_TO_TEST`
> 
>
> Key: SPARK-46692
> URL: https://issues.apache.org/jira/browse/SPARK-46692
> Project: Spark
>  Issue Type: Bug
>  Components: Build, PySpark
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46692) Fix potential issues with environment variable transmission `PYTHON_TO_TEST` in `build_python`

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46692:
---
Labels: pull-request-available  (was: )

> Fix potential issues with environment variable transmission `PYTHON_TO_TEST` 
> in `build_python`
> --
>
> Key: SPARK-46692
> URL: https://issues.apache.org/jira/browse/SPARK-46692
> Project: Spark
>  Issue Type: Bug
>  Components: Build, PySpark
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46383) Reduce Driver Heap Usage by Reducing the Lifespan of `TaskInfo.accumulables()`

2024-01-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46383:
---

Assignee: Utkarsh Agarwal

> Reduce Driver Heap Usage by Reducing the Lifespan of `TaskInfo.accumulables()`
> --
>
> Key: SPARK-46383
> URL: https://issues.apache.org/jira/browse/SPARK-46383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Utkarsh Agarwal
>Assignee: Utkarsh Agarwal
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screenshot 2023-11-06 at 3.56.26 PM.png, screenshot-1.png
>
>
> `AccumulableInfo` is one of the top heap consumers in driver's heap dumps for 
> stages with many tasks. For a stage with a large number of tasks 
> ({_}O(100k){_}), we saw {*}{{*}}30%{{*}}{*} of the heap usage stemming from 
> `TaskInfo.accumulables()`.
> !screenshot-1.png|width=641,height=98!  
> The `TaskSetManager` today keeps around the TaskInfo objects 
> ([ref1|https://github.com/apache/spark/blob/c1ba963e64a22dea28e17b1ed954e6d03d38da1e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L134],
>  
> [ref2|https://github.com/apache/spark/blob/c1ba963e64a22dea28e17b1ed954e6d03d38da1e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L192]))
>  and in turn the task metrics (`AccumulableInfo`) for every task attempt 
> until the stage is completed. This means that for stages with a large number 
> of tasks, we keep metrics for all the tasks (`AccumulableInfo`) around even 
> when the task has completed and its metrics have been aggregated. Given a 
> task has a large number of metrics, stages with many tasks end up with a 
> large heap usage in the form of task metrics.
> Ideally, we should clear up a task's TaskInfo upon the task's completion, 
> thereby reducing the driver's heap usage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46383) Reduce Driver Heap Usage by Reducing the Lifespan of `TaskInfo.accumulables()`

2024-01-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46383.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44321
[https://github.com/apache/spark/pull/44321]

> Reduce Driver Heap Usage by Reducing the Lifespan of `TaskInfo.accumulables()`
> --
>
> Key: SPARK-46383
> URL: https://issues.apache.org/jira/browse/SPARK-46383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Utkarsh Agarwal
>Assignee: Utkarsh Agarwal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: Screenshot 2023-11-06 at 3.56.26 PM.png, screenshot-1.png
>
>
> `AccumulableInfo` is one of the top heap consumers in driver's heap dumps for 
> stages with many tasks. For a stage with a large number of tasks 
> ({_}O(100k){_}), we saw {*}{{*}}30%{{*}}{*} of the heap usage stemming from 
> `TaskInfo.accumulables()`.
> !screenshot-1.png|width=641,height=98!  
> The `TaskSetManager` today keeps around the TaskInfo objects 
> ([ref1|https://github.com/apache/spark/blob/c1ba963e64a22dea28e17b1ed954e6d03d38da1e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L134],
>  
> [ref2|https://github.com/apache/spark/blob/c1ba963e64a22dea28e17b1ed954e6d03d38da1e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L192]))
>  and in turn the task metrics (`AccumulableInfo`) for every task attempt 
> until the stage is completed. This means that for stages with a large number 
> of tasks, we keep metrics for all the tasks (`AccumulableInfo`) around even 
> when the task has completed and its metrics have been aggregated. Given a 
> task has a large number of metrics, stages with many tasks end up with a 
> large heap usage in the form of task metrics.
> Ideally, we should clear up a task's TaskInfo upon the task's completion, 
> thereby reducing the driver's heap usage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46640) RemoveRedundantAliases does not account for SubqueryExpression when removing aliases

2024-01-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46640.
-
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 44645
[https://github.com/apache/spark/pull/44645]

> RemoveRedundantAliases does not account for SubqueryExpression when removing 
> aliases
> 
>
> Key: SPARK-46640
> URL: https://issues.apache.org/jira/browse/SPARK-46640
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 4.0.0
>Reporter: Nikhil Sheoran
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>
> `RemoveRedundantAliases{{{}`{}}} does not take into account the outer 
> attributes of a `SubqueryExpression` aliases, potentially removing them if it 
> thinks they are redundant.
> This can cause scenarios where a subquery expression has conditions like `a#x 
> = a#x` i.e. both the attribute names and the expression ID(s) are the same. 
> This can then lead to conflicting expression ID(s) error.
> In `RemoveRedundantAliases`, we have an excluded AttributeSet argument 
> denoting the references for which we should not remove aliases. For a query 
> with a subquery expression, adding the references of this subquery in the 
> excluded set prevents such rewrite from happening.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46692) Fix potential issues with environment variable transmission `PYTHON_TO_TEST` in `build_python`

2024-01-11 Thread BingKun Pan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-46692:

Summary: Fix potential issues with environment variable transmission 
`PYTHON_TO_TEST` in `build_python`  (was: Fix potential issues with environment 
variable transmission `$PYTHON_TO_TEST` in `build_python`)

> Fix potential issues with environment variable transmission `PYTHON_TO_TEST` 
> in `build_python`
> --
>
> Key: SPARK-46692
> URL: https://issues.apache.org/jira/browse/SPARK-46692
> Project: Spark
>  Issue Type: Bug
>  Components: Build, PySpark
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46640) RemoveRedundantAliases does not account for SubqueryExpression when removing aliases

2024-01-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46640:
---

Assignee: Nikhil Sheoran

> RemoveRedundantAliases does not account for SubqueryExpression when removing 
> aliases
> 
>
> Key: SPARK-46640
> URL: https://issues.apache.org/jira/browse/SPARK-46640
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 4.0.0
>Reporter: Nikhil Sheoran
>Assignee: Nikhil Sheoran
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.1
>
>
> `RemoveRedundantAliases{{{}`{}}} does not take into account the outer 
> attributes of a `SubqueryExpression` aliases, potentially removing them if it 
> thinks they are redundant.
> This can cause scenarios where a subquery expression has conditions like `a#x 
> = a#x` i.e. both the attribute names and the expression ID(s) are the same. 
> This can then lead to conflicting expression ID(s) error.
> In `RemoveRedundantAliases`, we have an excluded AttributeSet argument 
> denoting the references for which we should not remove aliases. For a query 
> with a subquery expression, adding the references of this subquery in the 
> excluded set prevents such rewrite from happening.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46692) Fix potential issues with environment variable transmission `$PYTHON_TO_TEST` in `build_python`

2024-01-11 Thread BingKun Pan (Jira)

BingKun Pan created SPARK-46692:
---

 Summary: Fix potential issues with environment variable 
transmission `$PYTHON_TO_TEST` in `build_python`
 Key: SPARK-46692
 URL: https://issues.apache.org/jira/browse/SPARK-46692
 Project: Spark
  Issue Type: Bug
  Components: Build, PySpark
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46670) Make DataSourceManager isolated and self clone-able

2024-01-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-46670:


Assignee: Hyukjin Kwon

> Make DataSourceManager isolated and self clone-able 
> 
>
> Key: SPARK-46670
> URL: https://issues.apache.org/jira/browse/SPARK-46670
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>
> Make DataSourceManager isolated and self clone-able 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46670) Make DataSourceManager isolated and self clone-able

2024-01-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-46670.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44681
[https://github.com/apache/spark/pull/44681]

> Make DataSourceManager isolated and self clone-able 
> 
>
> Key: SPARK-46670
> URL: https://issues.apache.org/jira/browse/SPARK-46670
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Make DataSourceManager isolated and self clone-able 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46686) Basic support of SparkSession based Python UDF profiler

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46686:
---
Labels: pull-request-available  (was: )

> Basic support of SparkSession based Python UDF profiler
> ---
>
> Key: SPARK-46686
> URL: https://issues.apache.org/jira/browse/SPARK-46686
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46691) Support profiling on WindowInPandasExec

2024-01-11 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-46691:
-

 Summary: Support profiling on WindowInPandasExec
 Key: SPARK-46691
 URL: https://issues.apache.org/jira/browse/SPARK-46691
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46690) Support profiling on FlatMapCoGroupsInBatchExec

2024-01-11 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-46690:
-

 Summary: Support profiling on FlatMapCoGroupsInBatchExec
 Key: SPARK-46690
 URL: https://issues.apache.org/jira/browse/SPARK-46690
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46689) Support profiling on FlatMapGroupsInBatchExec

2024-01-11 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-46689:
-

 Summary: Support profiling on FlatMapGroupsInBatchExec
 Key: SPARK-46689
 URL: https://issues.apache.org/jira/browse/SPARK-46689
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46688) Support profiling on AggregateInPandasExec

2024-01-11 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-46688:
-

 Summary: Support profiling on AggregateInPandasExec
 Key: SPARK-46688
 URL: https://issues.apache.org/jira/browse/SPARK-46688
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46687) Implement memory-profiler

2024-01-11 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-46687:
-

 Summary: Implement memory-profiler
 Key: SPARK-46687
 URL: https://issues.apache.org/jira/browse/SPARK-46687
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46686) Basic support of SparkSession based Python UDF profiler

2024-01-11 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-46686:
-

 Summary: Basic support of SparkSession based Python UDF profiler
 Key: SPARK-46686
 URL: https://issues.apache.org/jira/browse/SPARK-46686
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46685) Introduce SparkSession based PySpark UDF profiler

2024-01-11 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-46685:
-

 Summary: Introduce SparkSession based PySpark UDF profiler
 Key: SPARK-46685
 URL: https://issues.apache.org/jira/browse/SPARK-46685
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Takuya Ueshin


The existing UDF profilers are SparkContext based, which can't support Spark 
Connect.

We should introduce SparkSession based profilers and support Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46667) XML: Throw error on multiple XML data source

2024-01-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-46667:


Assignee: Sandip Agarwala

> XML: Throw error on multiple XML data source
> 
>
> Key: SPARK-46667
> URL: https://issues.apache.org/jira/browse/SPARK-46667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Assignee: Sandip Agarwala
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46667) XML: Throw error on multiple XML data source

2024-01-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-46667.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44685
[https://github.com/apache/spark/pull/44685]

> XML: Throw error on multiple XML data source
> 
>
> Key: SPARK-46667
> URL: https://issues.apache.org/jira/browse/SPARK-46667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Assignee: Sandip Agarwala
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46682) Upgrade `curator` to 5.6.0

2024-01-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46682.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44694
[https://github.com/apache/spark/pull/44694]

> Upgrade `curator` to 5.6.0
> --
>
> Key: SPARK-46682
> URL: https://issues.apache.org/jira/browse/SPARK-46682
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46683) Write a subquery generator that generates subqueries of different variations to increase testing coverage in this area

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46683:
---
Labels: correctness pull-request-available testing  (was: correctness 
testing)

> Write a subquery generator that generates subqueries of different variations 
> to increase testing coverage in this area
> --
>
> Key: SPARK-46683
> URL: https://issues.apache.org/jira/browse/SPARK-46683
> Project: Spark
>  Issue Type: Test
>  Components: Optimizer, SQL
>Affects Versions: 3.5.1
>Reporter: Andy Lam
>Priority: Major
>  Labels: correctness, pull-request-available, testing
>
> There are a lot of subquery correctness issues, ranging from very old bugs to 
> new ones that are being introduced due to work being done on subquery 
> correlation optimization. This is especially in the areas of COUNT bugs and 
> null behaviors.
> To increase test coverage and robustness in this area, we want to write a 
> subquery generator that writes variations of subqueries, producing SQL tests 
> that also run against Postgres (from my work in SPARK-46179).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46684) CoGroup.applyInPandas/Arrow should pass arguments properly

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46684:
---
Labels: pull-request-available  (was: )

> CoGroup.applyInPandas/Arrow should pass arguments properly
> --
>
> Key: SPARK-46684
> URL: https://issues.apache.org/jira/browse/SPARK-46684
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Takuya Ueshin
>Priority: Major
>  Labels: pull-request-available
>
> In Spark Connect, {{CoGroup.applyInPandas/Arrow}} doesn't take arguments 
> properly, so the arguments of the UDF can be broken:
> {noformat}
> >>> import pandas as pd
> >>>
> >>> df1 = spark.createDataFrame(
> ... [(1, 1.0, "a"), (2, 2.0, "b"), (1, 3.0, "c"), (2, 4.0, "d")], ("id", 
> "v1", "v2")
> ... )
> >>> df2 = spark.createDataFrame([(1, "x"), (2, "y"), (1, "z")], ("id", "v3"))
> >>>
> >>> def summarize(left, right):
> ... return pd.DataFrame(
> ... {
> ... "left_rows": [len(left)],
> ... "left_columns": [len(left.columns)],
> ... "right_rows": [len(right)],
> ... "right_columns": [len(right.columns)],
> ... }
> ... )
> ...
> >>> df = (
> ... df1.groupby("id")
> ... .cogroup(df2.groupby("id"))
> ... .applyInPandas(
> ... summarize,
> ... schema="left_rows long, left_columns long, right_rows long, 
> right_columns long",
> ... )
> ... )
> >>>
> >>> df.show()
> +-++--+-+
> |left_rows|left_columns|right_rows|right_columns|
> +-++--+-+
> |2|   1| 2|1|
> |2|   1| 1|1|
> +-++--+-+
> {noformat}
> The result should be:
> {noformat}
> +-++--+-+
> |left_rows|left_columns|right_rows|right_columns|
> +-++--+-+
> |        2|           3|         2|            2|
> |        2|           3|         1|            2|
> +-++--+-+
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46665) Remove assertPandasOnSparkEqual

2024-01-11 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-46665:

Summary: Remove assertPandasOnSparkEqual  (was: Remove Pandas dependency 
for pyspark.testing)

> Remove assertPandasOnSparkEqual
> ---
>
> Key: SPARK-46665
> URL: https://issues.apache.org/jira/browse/SPARK-46665
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> We should not make pyspark.testing depending on Pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46665) Remove assertPandasOnSparkEqual

2024-01-11 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-46665:

Description: Remove deprecated API  (was: We should not make 
pyspark.testing depending on Pandas.)

> Remove assertPandasOnSparkEqual
> ---
>
> Key: SPARK-46665
> URL: https://issues.apache.org/jira/browse/SPARK-46665
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> Remove deprecated API



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46684) CoGroup.applyInPandas/Arrow should pass arguments properly

2024-01-11 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-46684:
-

 Summary: CoGroup.applyInPandas/Arrow should pass arguments properly
 Key: SPARK-46684
 URL: https://issues.apache.org/jira/browse/SPARK-46684
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.5.0
Reporter: Takuya Ueshin


In Spark Connect, {{CoGroup.applyInPandas/Arrow}} doesn't take arguments 
properly, so the arguments of the UDF can be broken:
{noformat}
>>> import pandas as pd
>>>
>>> df1 = spark.createDataFrame(
... [(1, 1.0, "a"), (2, 2.0, "b"), (1, 3.0, "c"), (2, 4.0, "d")], ("id", 
"v1", "v2")
... )
>>> df2 = spark.createDataFrame([(1, "x"), (2, "y"), (1, "z")], ("id", "v3"))
>>>
>>> def summarize(left, right):
... return pd.DataFrame(
... {
... "left_rows": [len(left)],
... "left_columns": [len(left.columns)],
... "right_rows": [len(right)],
... "right_columns": [len(right.columns)],
... }
... )
...
>>> df = (
... df1.groupby("id")
... .cogroup(df2.groupby("id"))
... .applyInPandas(
... summarize,
... schema="left_rows long, left_columns long, right_rows long, 
right_columns long",
... )
... )
>>>
>>> df.show()
+-++--+-+
|left_rows|left_columns|right_rows|right_columns|
+-++--+-+
|2|   1| 2|1|
|2|   1| 1|1|
+-++--+-+
{noformat}

The result should be:

{noformat}
+-++--+-+
|left_rows|left_columns|right_rows|right_columns|
+-++--+-+
|        2|           3|         2|            2|
|        2|           3|         1|            2|
+-++--+-+
{noformat}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46683) Write a subquery generator that generates subqueries of different variations to increase testing coverage in this area

2024-01-11 Thread Andy Lam (Jira)

Andy Lam created SPARK-46683:


 Summary: Write a subquery generator that generates subqueries of 
different variations to increase testing coverage in this area
 Key: SPARK-46683
 URL: https://issues.apache.org/jira/browse/SPARK-46683
 Project: Spark
  Issue Type: Test
  Components: Optimizer, SQL
Affects Versions: 3.5.1
Reporter: Andy Lam


There are a lot of subquery correctness issues, ranging from very old bugs to 
new ones that are being introduced due to work being done on subquery 
correlation optimization. This is especially in the areas of COUNT bugs and 
null behaviors.

To increase test coverage and robustness in this area, we want to write a 
subquery generator that writes variations of subqueries, producing SQL tests 
that also run against Postgres (from my work in SPARK-46179).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46682) Upgrade `curator` to 5.6.0

2024-01-11 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-46682:
-

 Summary: Upgrade `curator` to 5.6.0
 Key: SPARK-46682
 URL: https://issues.apache.org/jira/browse/SPARK-46682
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46368) Support `readyz` in REST Submission API

2024-01-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46368.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44692
[https://github.com/apache/spark/pull/44692]

> Support `readyz` in REST Submission API
> ---
>
> Key: SPARK-46368
> URL: https://issues.apache.org/jira/browse/SPARK-46368
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Like https://kubernetes.io/docs/reference/using-api/health-checks/, we need 
> to provide `/readyz` API.
> As a workaround, we can use the following.
> {code}
> readinessProbe:
>   exec:
> command: ["sh", "-c", "! (curl -s 
> http://localhost:6066/v1/submissions/status/none | grep -q STANDBY)"]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46671) InferFiltersFromConstraint rule is creating a redundant filter

2024-01-11 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-46671:
-
Description: 
while bring my old PR which uses a different approach to  the 
ConstraintPropagation algorithm ( 
[SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch with 
current master, I noticed a test failure in my branch for SPARK-33152:
The test which is failing is
InferFiltersFromConstraintSuite:
{code}
  test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: 
Infer Filters") {
val x = testRelation.as("x")
val y = testRelation.as("y")
val z = testRelation.as("z")

// Removes EqualNullSafe when constructing candidate constraints
comparePlans(
  InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
.where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
  x.select($"x.a", $"x.a".as("xa"))
.where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && $"xa" 
=== $"x.a").analyze)

// Once strategy's idempotence is not broken
val originalQuery =
  x.join(y, condition = Some($"x.a" === $"y.a"))
.select($"x.a", $"x.a".as("xa")).as("xy")
.join(z, condition = Some($"xy.a" === $"z.a")).analyze

val correctAnswer =
  x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = 
Some($"x.a" === $"y.a"))
.select($"x.a", $"x.a".as("xa")).as("xy")
.join(z.where($"a".isNotNull), condition = Some($"xy.a" === 
$"z.a")).analyze

val optimizedQuery = InferFiltersFromConstraints(originalQuery)
comparePlans(optimizedQuery, correctAnswer)
comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer)
  }
{code}

In the above test, I believe the below assertion is not proper.
There is a redundant filter which is getting created.
Out of these two isNotNull constraints,  only one should be created.

$"xa".isNotNull && $"x.a".isNotNull 
 Because "xa" is an alias of x."a" , so only one isNullConstraint is needed.

  // Removes EqualNullSafe when constructing candidate constraints
comparePlans(
  InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
.where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
  x.select($"x.a", $"x.a".as("xa"))
.where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && $"xa" 
=== $"x.a").analyze) 

This is not a big issue, but it highlights the need to take a relook at the 
code of ConstraintPropagation and related code.

I am filing this jira so that constraint code can be tightened/made more robust.

  was:
while bring my old PR which uses a different approach to  the 
ConstraintPropagation algorithm ( 
[SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch with 
current master, I noticed a test failure in my branch for SPARK-33152:
The test which is failing is
InferFiltersFromConstraintSuite:
{code}
  test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: 
Infer Filters") {
val x = testRelation.as("x")
val y = testRelation.as("y")
val z = testRelation.as("z")

// Removes EqualNullSafe when constructing candidate constraints
comparePlans(
  InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
.where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
  x.select($"x.a", $"x.a".as("xa"))
.where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && $"xa" 
=== $"x.a").analyze)

// Once strategy's idempotence is not broken
val originalQuery =
  x.join(y, condition = Some($"x.a" === $"y.a"))
.select($"x.a", $"x.a".as("xa")).as("xy")
.join(z, condition = Some($"xy.a" === $"z.a")).analyze

val correctAnswer =
  x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = 
Some($"x.a" === $"y.a"))
.select($"x.a", $"x.a".as("xa")).as("xy")
.join(z.where($"a".isNotNull), condition = Some($"xy.a" === 
$"z.a")).analyze

val optimizedQuery = InferFiltersFromConstraints(originalQuery)
comparePlans(optimizedQuery, correctAnswer)
comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer)
  }
{code}

In the above test, I believe the below assertion is not proper.
There is a redundant filter which is getting created.
Out of these two isNotNull constraints,  only one should be created.

$"xa".isNotNull && $"x.a".isNotNull 
 Because presence of (xa#0 = a#0), automatically implies that is one attribute 
is not null, the other also has to be not null.

  // Removes EqualNullSafe when constructing candidate constraints
comparePlans(
  InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
.where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
  x.select($"x.a", $"x.a".as("xa"))
.where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && $"xa" 
=== $"x.a").analyze) 

This is not a big issue, but it highlights the

[jira] [Reopened] (SPARK-46671) InferFiltersFromConstraint rule is creating a redundant filter

2024-01-11 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif reopened SPARK-46671:
--

After further analysis , I believe , that what I said originally in the ticket 
is valid and that the code Does create a redundant constraint.

The reason is "xa" is an alias of "a", so there should be a IsNotNull 
constraint on only one of the attribute and not both. 

> InferFiltersFromConstraint rule is creating a redundant filter
> --
>
> Key: SPARK-46671
> URL: https://issues.apache.org/jira/browse/SPARK-46671
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Minor
>  Labels: SQL, catalyst
>
> while bring my old PR which uses a different approach to  the 
> ConstraintPropagation algorithm ( 
> [SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch 
> with current master, I noticed a test failure in my branch for SPARK-33152:
> The test which is failing is
> InferFiltersFromConstraintSuite:
> {code}
>   test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: 
> Infer Filters") {
> val x = testRelation.as("x")
> val y = testRelation.as("y")
> val z = testRelation.as("z")
> // Removes EqualNullSafe when constructing candidate constraints
> comparePlans(
>   InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
> .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
>   x.select($"x.a", $"x.a".as("xa"))
> .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && 
> $"xa" === $"x.a").analyze)
> // Once strategy's idempotence is not broken
> val originalQuery =
>   x.join(y, condition = Some($"x.a" === $"y.a"))
> .select($"x.a", $"x.a".as("xa")).as("xy")
> .join(z, condition = Some($"xy.a" === $"z.a")).analyze
> val correctAnswer =
>   x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = 
> Some($"x.a" === $"y.a"))
> .select($"x.a", $"x.a".as("xa")).as("xy")
> .join(z.where($"a".isNotNull), condition = Some($"xy.a" === 
> $"z.a")).analyze
> val optimizedQuery = InferFiltersFromConstraints(originalQuery)
> comparePlans(optimizedQuery, correctAnswer)
> comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer)
>   }
> {code}
> In the above test, I believe the below assertion is not proper.
> There is a redundant filter which is getting created.
> Out of these two isNotNull constraints,  only one should be created.
> $"xa".isNotNull && $"x.a".isNotNull 
>  Because presence of (xa#0 = a#0), automatically implies that is one 
> attribute is not null, the other also has to be not null.
>   // Removes EqualNullSafe when constructing candidate constraints
> comparePlans(
>   InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
> .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
>   x.select($"x.a", $"x.a".as("xa"))
> .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && 
> $"xa" === $"x.a").analyze) 
> This is not a big issue, but it highlights the need to take a relook at the 
> code of ConstraintPropagation and related code.
> I am filing this jira so that constraint code can be tightened/made more 
> robust.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46655) Skip query context catching in DataFrame methods

2024-01-11 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-46655.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44501
[https://github.com/apache/spark/pull/44501]

> Skip query context catching in DataFrame methods
> 
>
> Key: SPARK-46655
> URL: https://issues.apache.org/jira/browse/SPARK-46655
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> To improve user experience with Spark DataFrame/Dataset APIs, and provide 
> more precise context of errors, catching of Dataframe query context can be 
> skipped in Dataframe/Dataset  methods.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46368) Support `readyz` in REST Submission API

2024-01-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46368:
--
Summary: Support `readyz` in REST Submission API  (was: Support `readyz` 
API)

> Support `readyz` in REST Submission API
> ---
>
> Key: SPARK-46368
> URL: https://issues.apache.org/jira/browse/SPARK-46368
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> Like https://kubernetes.io/docs/reference/using-api/health-checks/, we need 
> to provide `/readyz` API.
> As a workaround, we can use the following.
> {code}
> readinessProbe:
>   exec:
> command: ["sh", "-c", "! (curl -s 
> http://localhost:6066/v1/submissions/status/none | grep -q STANDBY)"]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46368) Support `readyz` API

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46368:
---
Labels: pull-request-available  (was: )

> Support `readyz` API
> 
>
> Key: SPARK-46368
> URL: https://issues.apache.org/jira/browse/SPARK-46368
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> Like https://kubernetes.io/docs/reference/using-api/health-checks/, we need 
> to provide `/readyz` API.
> As a workaround, we can use the following.
> {code}
> readinessProbe:
>   exec:
> command: ["sh", "-c", "! (curl -s 
> http://localhost:6066/v1/submissions/status/none | grep -q STANDBY)"]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46680) Upgrade Apache commons-pool2 to 2.12.0

2024-01-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46680.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44683
[https://github.com/apache/spark/pull/44683]

> Upgrade Apache commons-pool2 to 2.12.0
> --
>
> Key: SPARK-46680
> URL: https://issues.apache.org/jira/browse/SPARK-46680
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> https://github.com/apache/commons-pool/blob/rel/commons-pool-2.12.0/RELEASE-NOTES.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46681) Refactor `ExecutorFailureTracker#maxNumExecutorFailures` to avoid unnecessary computations when `MAX_EXECUTOR_FAILURES` is configured

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46681:
---
Labels: pull-request-available  (was: )

> Refactor `ExecutorFailureTracker#maxNumExecutorFailures` to avoid unnecessary 
> computations when `MAX_EXECUTOR_FAILURES` is configured
> -
>
> Key: SPARK-46681
> URL: https://issues.apache.org/jira/browse/SPARK-46681
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>
> {code:java}
> def maxNumExecutorFailures(sparkConf: SparkConf): Int = {
>   val effectiveNumExecutors =
> if (Utils.isStreamingDynamicAllocationEnabled(sparkConf)) {
>   sparkConf.get(STREAMING_DYN_ALLOCATION_MAX_EXECUTORS)
> } else if (Utils.isDynamicAllocationEnabled(sparkConf)) {
>   sparkConf.get(DYN_ALLOCATION_MAX_EXECUTORS)
> } else {
>   sparkConf.get(EXECUTOR_INSTANCES).getOrElse(0)
> }
>   // By default, effectiveNumExecutors is Int.MaxValue if dynamic allocation 
> is enabled. We need
>   // avoid the integer overflow here.
>   val defaultMaxNumExecutorFailures = math.max(3,
> if (effectiveNumExecutors > Int.MaxValue / 2) Int.MaxValue else 2 * 
> effectiveNumExecutors)
>   
> sparkConf.get(MAX_EXECUTOR_FAILURES).getOrElse(defaultMaxNumExecutorFailures)
> } {code}
> The result of defaultMaxNumExecutorFailures is calculated first, even if 
> {{MAX_EXECUTOR_FAILURES}} is configured now
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46681) Refactor `ExecutorFailureTracker#maxNumExecutorFailures` to avoid unnecessary computations when `MAX_EXECUTOR_FAILURES` is configured

2024-01-11 Thread Yang Jie (Jira)

Yang Jie created SPARK-46681:


 Summary: Refactor `ExecutorFailureTracker#maxNumExecutorFailures` 
to avoid unnecessary computations when `MAX_EXECUTOR_FAILURES` is configured
 Key: SPARK-46681
 URL: https://issues.apache.org/jira/browse/SPARK-46681
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Yang Jie


{code:java}
def maxNumExecutorFailures(sparkConf: SparkConf): Int = {
  val effectiveNumExecutors =
if (Utils.isStreamingDynamicAllocationEnabled(sparkConf)) {
  sparkConf.get(STREAMING_DYN_ALLOCATION_MAX_EXECUTORS)
} else if (Utils.isDynamicAllocationEnabled(sparkConf)) {
  sparkConf.get(DYN_ALLOCATION_MAX_EXECUTORS)
} else {
  sparkConf.get(EXECUTOR_INSTANCES).getOrElse(0)
}
  // By default, effectiveNumExecutors is Int.MaxValue if dynamic allocation is 
enabled. We need
  // avoid the integer overflow here.
  val defaultMaxNumExecutorFailures = math.max(3,
if (effectiveNumExecutors > Int.MaxValue / 2) Int.MaxValue else 2 * 
effectiveNumExecutors)

  sparkConf.get(MAX_EXECUTOR_FAILURES).getOrElse(defaultMaxNumExecutorFailures)
} {code}
The result of defaultMaxNumExecutorFailures is calculated first, even if 
{{MAX_EXECUTOR_FAILURES}} is configured now
 
 
 
 
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46680) Upgrade Apache commons-pool2 to 2.12.0

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46680:
---
Labels: pull-request-available  (was: )

> Upgrade Apache commons-pool2 to 2.12.0
> --
>
> Key: SPARK-46680
> URL: https://issues.apache.org/jira/browse/SPARK-46680
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/apache/commons-pool/blob/rel/commons-pool-2.12.0/RELEASE-NOTES.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46680) Upgrade Apache commons-pool2 to 2.12.0

2024-01-11 Thread Yang Jie (Jira)

Yang Jie created SPARK-46680:


 Summary: Upgrade Apache commons-pool2 to 2.12.0
 Key: SPARK-46680
 URL: https://issues.apache.org/jira/browse/SPARK-46680
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: Yang Jie


https://github.com/apache/commons-pool/blob/rel/commons-pool-2.12.0/RELEASE-NOTES.txt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46368) Support `/readyz` API

2024-01-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46368:
-

Assignee: Dongjoon Hyun

> Support `/readyz` API
> -
>
> Key: SPARK-46368
> URL: https://issues.apache.org/jira/browse/SPARK-46368
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Like https://kubernetes.io/docs/reference/using-api/health-checks/, we need 
> to provide `/readyz` API.
> As a workaround, we can use the following.
> {code}
> readinessProbe:
>   exec:
> command: ["sh", "-c", "! (curl -s 
> http://localhost:6066/v1/submissions/status/none | grep -q STANDBY)"]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46368) Support `readyz` API

2024-01-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46368:
--
Summary: Support `readyz` API  (was: Support `/readyz` API)

> Support `readyz` API
> 
>
> Key: SPARK-46368
> URL: https://issues.apache.org/jira/browse/SPARK-46368
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Like https://kubernetes.io/docs/reference/using-api/health-checks/, we need 
> to provide `/readyz` API.
> As a workaround, we can use the following.
> {code}
> readinessProbe:
>   exec:
> command: ["sh", "-c", "! (curl -s 
> http://localhost:6066/v1/submissions/status/none | grep -q STANDBY)"]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-44638) Unable to read from JDBC data sources when using custom schema containing varchar

2024-01-11 Thread Kent Yao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-44638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805505#comment-17805505
 ] 

Kent Yao commented on SPARK-44638:
--

Can you reproduce this issue on 3.5.0 or master branch?

> Unable to read from JDBC data sources when using custom schema containing 
> varchar
> -
>
> Key: SPARK-44638
> URL: https://issues.apache.org/jira/browse/SPARK-44638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.4, 3.3.2, 3.4.1
>Reporter: Michael Said
>Priority: Critical
>
> When querying the data from JDBC databases with custom schema containing 
> varchar I got this error :
> {code:java}
> [23/07/14 06:12:19 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) ( 
> executor 1): java.sql.SQLException: Unsupported type varchar(100) at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedJdbcTypeError(QueryExecutionErrors.scala:818)
>  23/07/14 06:12:21 INFO TaskSetManager: Lost task 0.1 in stage 1.0 (TID 2) on 
> , executor 0: java.sql.SQLException (Unsupported type varchar(100)){code}
> Code example: 
> {code:java}
> CUSTOM_SCHEMA="ID Integer, NAME VARCHAR(100)"
> df = spark.read.format("jdbc")
> .option("url", "jdbc:oracle:thin:@0.0.0.0:1521:db")
> .option("driver", "oracle.jdbc.OracleDriver")
> .option("dbtable", "table")
> .option("customSchema", CUSTOM_SCHEMA)
> .option("user", "user")
> .option("password", "password")
> .load()
> df.show(){code}
> I tried to set {{spark.sql.legacy.charVarcharAsString = true}} to restore the 
> behavior before Spark 3.1 but it doesn't help.
> The issue occurs in version 3.1.0 and above. I believe that this issue is 
> caused by https://issues.apache.org/jira/browse/SPARK-33480



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46679) Encoders with multiple inheritance - Key not found: T

2024-01-11 Thread Andoni Teso (Jira)

Andoni Teso created SPARK-46679:
---

 Summary: Encoders with multiple inheritance - Key not found: T
 Key: SPARK-46679
 URL: https://issues.apache.org/jira/browse/SPARK-46679
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 3.4.2
Reporter: Andoni Teso
 Attachments: spark_test.zip

Since version 3.4, I've been experiencing the following error when using 
encoders.
{code:java}
Exception in thread "main" java.util.NoSuchElementException: key not found: T
    at scala.collection.immutable.Map$Map1.apply(Map.scala:163)
    at 
org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:121)
    at 
org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140)
    at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
    at 
org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138)
    at 
org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140)
    at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
    at 
org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138)
    at 
org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:60)
    at 
org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:53)
    at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:62)
    at org.apache.spark.sql.Encoders$.bean(Encoders.scala:179)
    at org.apache.spark.sql.Encoders.bean(Encoders.scala)
    at org.example.Main.main(Main.java:26) {code}
I'm attaching the code I use to reproduce the error locally. 

The issue is in the JavaTypeInference class when it tries to find the encoder 
for a ParameterizedType with the value Team. When running 
JavaTypeUtils.getTypeArguments(pt).asScala.toMap, it returns the type T again, 
but this time as a Company object, and pt.getRawType as Team. This ends up 
generating a tuple of Team, Company in the typeVariables map, leading to errors 
when searching for TypeVariables.

In my case, I've resolved this by doing the following:
{code:java}
case tv: TypeVariable[_] =>
  encoderFor(typeVariables.head._2, seenTypeSet, typeVariables)

case pt: ParameterizedType =>
  encoderFor(pt.getRawType, seenTypeSet, typeVariables) {code}
I haven't submitted a pull request because it doesn't seem to be the most 
optimal solution, or it might break some parts of the code. Additional 
validations or conditions may need to be added.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46679) Encoders with multiple inheritance - Key not found: T

2024-01-11 Thread Andoni Teso (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andoni Teso updated SPARK-46679:

Attachment: spark_test.zip

> Encoders with multiple inheritance - Key not found: T
> -
>
> Key: SPARK-46679
> URL: https://issues.apache.org/jira/browse/SPARK-46679
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Andoni Teso
>Priority: Major
> Attachments: spark_test.zip
>
>
> Since version 3.4, I've been experiencing the following error when using 
> encoders.
> {code:java}
> Exception in thread "main" java.util.NoSuchElementException: key not found: T
>     at scala.collection.immutable.Map$Map1.apply(Map.scala:163)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:121)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>     at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>     at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>     at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:60)
>     at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:53)
>     at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:62)
>     at org.apache.spark.sql.Encoders$.bean(Encoders.scala:179)
>     at org.apache.spark.sql.Encoders.bean(Encoders.scala)
>     at org.example.Main.main(Main.java:26) {code}
> I'm attaching the code I use to reproduce the error locally. 
> The issue is in the JavaTypeInference class when it tries to find the encoder 
> for a ParameterizedType with the value Team. When running 
> JavaTypeUtils.getTypeArguments(pt).asScala.toMap, it returns the type T 
> again, but this time as a Company object, and pt.getRawType as Team. This 
> ends up generating a tuple of Team, Company in the typeVariables map, leading 
> to errors when searching for TypeVariables.
> In my case, I've resolved this by doing the following:
> {code:java}
> case tv: TypeVariable[_] =>
>   encoderFor(typeVariables.head._2, seenTypeSet, typeVariables)
> case pt: ParameterizedType =>
>   encoderFor(pt.getRawType, seenTypeSet, typeVariables) {code}
> I haven't submitted a pull request because it doesn't seem to be the most 
> optimal solution, or it might break some parts of the code. Additional 
> validations or conditions may need to be added.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46679) Encoders with multiple inheritance - Key not found: T

2024-01-11 Thread Andoni Teso (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andoni Teso updated SPARK-46679:

Description: 
Since version 3.4, I've been experiencing the following error when using 
encoders.
{code:java}
Exception in thread "main" java.util.NoSuchElementException: key not found: T
    at scala.collection.immutable.Map$Map1.apply(Map.scala:163)
    at 
org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:121)
    at 
org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140)
    at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
    at 
org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138)
    at 
org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140)
    at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
    at 
org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138)
    at 
org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:60)
    at 
org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:53)
    at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:62)
    at org.apache.spark.sql.Encoders$.bean(Encoders.scala:179)
    at org.apache.spark.sql.Encoders.bean(Encoders.scala)
    at org.example.Main.main(Main.java:26) {code}
I'm attaching the code I use to reproduce the error locally.  [^spark_test.zip]

The issue is in the JavaTypeInference class when it tries to find the encoder 
for a ParameterizedType with the value Team. When running 
JavaTypeUtils.getTypeArguments(pt).asScala.toMap, it returns the type T again, 
but this time as a Company object, and pt.getRawType as Team. This ends up 
generating a tuple of Team, Company in the typeVariables map, leading to errors 
when searching for TypeVariables.

In my case, I've resolved this by doing the following:
{code:java}
case tv: TypeVariable[_] =>
  encoderFor(typeVariables.head._2, seenTypeSet, typeVariables)

case pt: ParameterizedType =>
  encoderFor(pt.getRawType, seenTypeSet, typeVariables) {code}
I haven't submitted a pull request because it doesn't seem to be the most 
optimal solution, or it might break some parts of the code. Additional 
validations or conditions may need to be added.

  was:
Since version 3.4, I've been experiencing the following error when using 
encoders.
{code:java}
Exception in thread "main" java.util.NoSuchElementException: key not found: T
    at scala.collection.immutable.Map$Map1.apply(Map.scala:163)
    at 
org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:121)
    at 
org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140)
    at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
    at 
org.apache.spark.sql.catalyst.JavaTypeInference$.encoderFor(JavaTypeInference.scala:138)
    at 
org.apache.spark.sql.catalyst.JavaTypeInference$.$anonfun$encoderFor$1(JavaTypeInference.scala:140)
    at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
    at

[jira] [Assigned] (SPARK-46678) Set datanucleus.autoStartMechanismMode=ignored to clean the wall of noisy logs

2024-01-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46678:
-

Assignee: Kent Yao

> Set datanucleus.autoStartMechanismMode=ignored to clean the wall of noisy logs
> --
>
> Key: SPARK-46678
> URL: https://issues.apache.org/jira/browse/SPARK-46678
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> [info] - 3.1: Decimal support of Avro Hive serde (1 second, 452 milliseconds)
> 16:20:31.482 ERROR org.apache.spark.sql.execution.command.DDLUtils: Failed to 
> find data source: avro when check data column names.
> org.apache.spark.sql.AnalysisException: Failed to find data source: avro. 
> Avro is built-in but external data source module since Spark 2.4. Please 
> deploy the application as per the deployment section of Apache Avro Data 
> Source Guide.
>   at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1630)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:660)
>   at 
> org.apache.spark.sql.execution.command.DDLUtils$.checkDataColNames(ddl.scala:1028)
>   at 
> org.apache.spark.sql.execution.command.DDLUtils$.$anonfun$checkTableColumns$1(ddl.scala:1016)
>   at 
> org.apache.spark.sql.execution.command.DDLUtils$.$anonfun$checkTableColumns$1$adapted(ddl.scala:1004)
>   at scala.Option.foreach(Option.scala:437)
>   at 
> org.apache.spark.sql.execution.command.DDLUtils$.checkTableColumns(ddl.scala:1004)
> 00:20:31.485 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.486 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.487 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.489 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.490 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.496 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.497 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.500 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.582 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.583 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.587 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.590 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.591 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.594 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.598 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.599 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.602 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.603 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
>

[jira] [Resolved] (SPARK-46678) Set datanucleus.autoStartMechanismMode=ignored to clean the wall of noisy logs

2024-01-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46678.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44687
[https://github.com/apache/spark/pull/44687]

> Set datanucleus.autoStartMechanismMode=ignored to clean the wall of noisy logs
> --
>
> Key: SPARK-46678
> URL: https://issues.apache.org/jira/browse/SPARK-46678
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {code:java}
> [info] - 3.1: Decimal support of Avro Hive serde (1 second, 452 milliseconds)
> 16:20:31.482 ERROR org.apache.spark.sql.execution.command.DDLUtils: Failed to 
> find data source: avro when check data column names.
> org.apache.spark.sql.AnalysisException: Failed to find data source: avro. 
> Avro is built-in but external data source module since Spark 2.4. Please 
> deploy the application as per the deployment section of Apache Avro Data 
> Source Guide.
>   at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1630)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:660)
>   at 
> org.apache.spark.sql.execution.command.DDLUtils$.checkDataColNames(ddl.scala:1028)
>   at 
> org.apache.spark.sql.execution.command.DDLUtils$.$anonfun$checkTableColumns$1(ddl.scala:1016)
>   at 
> org.apache.spark.sql.execution.command.DDLUtils$.$anonfun$checkTableColumns$1$adapted(ddl.scala:1004)
>   at scala.Option.foreach(Option.scala:437)
>   at 
> org.apache.spark.sql.execution.command.DDLUtils$.checkTableColumns(ddl.scala:1004)
> 00:20:31.485 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.486 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.487 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.489 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.490 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.496 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.497 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.500 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.582 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.583 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.587 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.590 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.591 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.594 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.598 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.599 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.602 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.603

[jira] [Updated] (SPARK-46678) Set datanucleus.autoStartMechanismMode=ignored to clean the wall of noisy logs

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46678:
---
Labels: pull-request-available  (was: )

> Set datanucleus.autoStartMechanismMode=ignored to clean the wall of noisy logs
> --
>
> Key: SPARK-46678
> URL: https://issues.apache.org/jira/browse/SPARK-46678
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> [info] - 3.1: Decimal support of Avro Hive serde (1 second, 452 milliseconds)
> 16:20:31.482 ERROR org.apache.spark.sql.execution.command.DDLUtils: Failed to 
> find data source: avro when check data column names.
> org.apache.spark.sql.AnalysisException: Failed to find data source: avro. 
> Avro is built-in but external data source module since Spark 2.4. Please 
> deploy the application as per the deployment section of Apache Avro Data 
> Source Guide.
>   at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1630)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:660)
>   at 
> org.apache.spark.sql.execution.command.DDLUtils$.checkDataColNames(ddl.scala:1028)
>   at 
> org.apache.spark.sql.execution.command.DDLUtils$.$anonfun$checkTableColumns$1(ddl.scala:1016)
>   at 
> org.apache.spark.sql.execution.command.DDLUtils$.$anonfun$checkTableColumns$1$adapted(ddl.scala:1004)
>   at scala.Option.foreach(Option.scala:437)
>   at 
> org.apache.spark.sql.execution.command.DDLUtils$.checkTableColumns(ddl.scala:1004)
> 00:20:31.485 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.486 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.487 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.489 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.490 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.496 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.497 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.500 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.582 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.583 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.587 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.590 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.591 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.594 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.598 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.599 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.602 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value: ignored
> 00:20:31.603 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
> datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
> it to value:

[jira] [Created] (SPARK-46678) Set datanucleus.autoStartMechanismMode=ignored to clean the wall of noisy logs

2024-01-11 Thread Kent Yao (Jira)

Kent Yao created SPARK-46678:


 Summary: Set datanucleus.autoStartMechanismMode=ignored to clean 
the wall of noisy logs
 Key: SPARK-46678
 URL: https://issues.apache.org/jira/browse/SPARK-46678
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao


{code:java}
[info] - 3.1: Decimal support of Avro Hive serde (1 second, 452 milliseconds)
16:20:31.482 ERROR org.apache.spark.sql.execution.command.DDLUtils: Failed to 
find data source: avro when check data column names.
org.apache.spark.sql.AnalysisException: Failed to find data source: avro. Avro 
is built-in but external data source module since Spark 2.4. Please deploy the 
application as per the deployment section of Apache Avro Data Source Guide.
at 
org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1630)
at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:660)
at 
org.apache.spark.sql.execution.command.DDLUtils$.checkDataColNames(ddl.scala:1028)
at 
org.apache.spark.sql.execution.command.DDLUtils$.$anonfun$checkTableColumns$1(ddl.scala:1016)
at 
org.apache.spark.sql.execution.command.DDLUtils$.$anonfun$checkTableColumns$1$adapted(ddl.scala:1004)
at scala.Option.foreach(Option.scala:437)
at 
org.apache.spark.sql.execution.command.DDLUtils$.checkTableColumns(ddl.scala:1004)
00:20:31.485 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
it to value: ignored
00:20:31.486 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
it to value: ignored
00:20:31.487 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
it to value: ignored
00:20:31.489 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
it to value: ignored
00:20:31.490 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
it to value: ignored
00:20:31.496 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
it to value: ignored
00:20:31.497 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
it to value: ignored

00:20:31.500 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
it to value: ignored
00:20:31.582 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
it to value: ignored
00:20:31.583 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
it to value: ignored

00:20:31.587 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
it to value: ignored

00:20:31.590 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
it to value: ignored

00:20:31.591 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
it to value: ignored

00:20:31.594 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
it to value: ignored

00:20:31.598 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
it to value: ignored

00:20:31.599 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
it to value: ignored

00:20:31.602 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
it to value: ignored

00:20:31.603 WARN org.apache.hadoop.hive.metastore.ObjectStore: 
datanucleus.autoStartMechanismMode is set to unsupported value null . Setting 
it to value: ignored
[info] - 3.1: read avro file containing decimal (135 milliseconds)
16:20:31.626 ERROR org.apache.spark.sql.execution.command.DDLUtils: Failed to 
find data source: avro when check data column names.
org.apache.spark.sql.AnalysisException: Failed to find data source: avro. Avro 
is built-in but external data source module since Spark 2.4. Please deploy the 
application as per the deployment section of Apache Avro Data Source Guide.
at

[jira] [Resolved] (SPARK-46672) Upgrade log4j2 to 2.22.1

2024-01-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46672.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44682
[https://github.com/apache/spark/pull/44682]

> Upgrade log4j2 to 2.22.1
> 
>
> Key: SPARK-46672
> URL: https://issues.apache.org/jira/browse/SPARK-46672
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46672) Upgrade log4j2 to 2.22.1

2024-01-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46672:
-

Assignee: Yang Jie

> Upgrade log4j2 to 2.22.1
> 
>
> Key: SPARK-46672
> URL: https://issues.apache.org/jira/browse/SPARK-46672
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46675) Remove unused inferTimestampNTZ in ParquetReadSupport

2024-01-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46675.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44686
[https://github.com/apache/spark/pull/44686]

> Remove unused inferTimestampNTZ in ParquetReadSupport
> -
>
> Key: SPARK-46675
> URL: https://issues.apache.org/jira/browse/SPARK-46675
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46675) Remove unused inferTimestampNTZ in ParquetReadSupport

2024-01-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46675:
-

Assignee: Cheng Pan

> Remove unused inferTimestampNTZ in ParquetReadSupport
> -
>
> Key: SPARK-46675
> URL: https://issues.apache.org/jira/browse/SPARK-46675
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46641) Add maxBytesPerTrigger threshold option

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46641:
--

Assignee: (was: Apache Spark)

> Add maxBytesPerTrigger threshold option
> ---
>
> Key: SPARK-46641
> URL: https://issues.apache.org/jira/browse/SPARK-46641
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Maksim Konstantinov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46641) Add maxBytesPerTrigger threshold option

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46641:
--

Assignee: Apache Spark

> Add maxBytesPerTrigger threshold option
> ---
>
> Key: SPARK-46641
> URL: https://issues.apache.org/jira/browse/SPARK-46641
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Maksim Konstantinov
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46676) dropDuplicatesWithinWatermark throws error on canonicalizing plan

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46676:
--

Assignee: (was: Apache Spark)

> dropDuplicatesWithinWatermark throws error on canonicalizing plan
> -
>
> Key: SPARK-46676
> URL: https://issues.apache.org/jira/browse/SPARK-46676
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>  Labels: pull-request-available
>
> Simply said, this test code fails:
> {code:java}
> test("SPARK-X: canonicalization of 
> StreamingDeduplicateWithinWatermarkExec should work") {
>   withTempDir { checkpoint =>
> val dedupeInputData = MemoryStream[(String, Int)]
> val dedupe = dedupeInputData.toDS()
>   .withColumn("eventTime", timestamp_seconds($"_2"))
>   .withWatermark("eventTime", "10 second")
>   .dropDuplicatesWithinWatermark("_1")
>   .select($"_1", $"eventTime".cast("long").as[Long])
> testStream(dedupe, Append)(
>   StartStream(checkpointLocation = checkpoint.getCanonicalPath),
>   AddData(dedupeInputData, "a" -> 1),
>   CheckNewAnswer("a" -> 1),
>   Execute { q =>
> // This threw out error!
> q.lastExecution.executedPlan.canonicalized
>   }
> )
>   }
> } {code}
> with below error:
> {code:java}
> [info] - SPARK-X: canonicalization of 
> StreamingDeduplicateWithinWatermarkExec should work *** FAILED *** (1 second, 
> 237 milliseconds)
> [info]   Assert on query failed: Execute: None.get
> [info]   scala.None$.get(Option.scala:627)
> [info]       scala.None$.get(Option.scala:626)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.(statefulOperators.scala:1101)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.copy(statefulOperators.scala:1092)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1148)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1087)
> [info]       
> org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1210)
> [info]       
> org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1208)
> [info]       
> org.apache.spark.sql.execution.streaming.BaseStreamingDeduplicateExec.withNewChildrenInternal(statefulOperators.scala:949)
> [info]       
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:323)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46676) dropDuplicatesWithinWatermark throws error on canonicalizing plan

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46676:
--

Assignee: Apache Spark

> dropDuplicatesWithinWatermark throws error on canonicalizing plan
> -
>
> Key: SPARK-46676
> URL: https://issues.apache.org/jira/browse/SPARK-46676
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Simply said, this test code fails:
> {code:java}
> test("SPARK-X: canonicalization of 
> StreamingDeduplicateWithinWatermarkExec should work") {
>   withTempDir { checkpoint =>
> val dedupeInputData = MemoryStream[(String, Int)]
> val dedupe = dedupeInputData.toDS()
>   .withColumn("eventTime", timestamp_seconds($"_2"))
>   .withWatermark("eventTime", "10 second")
>   .dropDuplicatesWithinWatermark("_1")
>   .select($"_1", $"eventTime".cast("long").as[Long])
> testStream(dedupe, Append)(
>   StartStream(checkpointLocation = checkpoint.getCanonicalPath),
>   AddData(dedupeInputData, "a" -> 1),
>   CheckNewAnswer("a" -> 1),
>   Execute { q =>
> // This threw out error!
> q.lastExecution.executedPlan.canonicalized
>   }
> )
>   }
> } {code}
> with below error:
> {code:java}
> [info] - SPARK-X: canonicalization of 
> StreamingDeduplicateWithinWatermarkExec should work *** FAILED *** (1 second, 
> 237 milliseconds)
> [info]   Assert on query failed: Execute: None.get
> [info]   scala.None$.get(Option.scala:627)
> [info]       scala.None$.get(Option.scala:626)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.(statefulOperators.scala:1101)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.copy(statefulOperators.scala:1092)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1148)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1087)
> [info]       
> org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1210)
> [info]       
> org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1208)
> [info]       
> org.apache.spark.sql.execution.streaming.BaseStreamingDeduplicateExec.withNewChildrenInternal(statefulOperators.scala:949)
> [info]       
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:323)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46676) dropDuplicatesWithinWatermark throws error on canonicalizing plan

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46676:
--

Assignee: (was: Apache Spark)

> dropDuplicatesWithinWatermark throws error on canonicalizing plan
> -
>
> Key: SPARK-46676
> URL: https://issues.apache.org/jira/browse/SPARK-46676
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>  Labels: pull-request-available
>
> Simply said, this test code fails:
> {code:java}
> test("SPARK-X: canonicalization of 
> StreamingDeduplicateWithinWatermarkExec should work") {
>   withTempDir { checkpoint =>
> val dedupeInputData = MemoryStream[(String, Int)]
> val dedupe = dedupeInputData.toDS()
>   .withColumn("eventTime", timestamp_seconds($"_2"))
>   .withWatermark("eventTime", "10 second")
>   .dropDuplicatesWithinWatermark("_1")
>   .select($"_1", $"eventTime".cast("long").as[Long])
> testStream(dedupe, Append)(
>   StartStream(checkpointLocation = checkpoint.getCanonicalPath),
>   AddData(dedupeInputData, "a" -> 1),
>   CheckNewAnswer("a" -> 1),
>   Execute { q =>
> // This threw out error!
> q.lastExecution.executedPlan.canonicalized
>   }
> )
>   }
> } {code}
> with below error:
> {code:java}
> [info] - SPARK-X: canonicalization of 
> StreamingDeduplicateWithinWatermarkExec should work *** FAILED *** (1 second, 
> 237 milliseconds)
> [info]   Assert on query failed: Execute: None.get
> [info]   scala.None$.get(Option.scala:627)
> [info]       scala.None$.get(Option.scala:626)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.(statefulOperators.scala:1101)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.copy(statefulOperators.scala:1092)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1148)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1087)
> [info]       
> org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1210)
> [info]       
> org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1208)
> [info]       
> org.apache.spark.sql.execution.streaming.BaseStreamingDeduplicateExec.withNewChildrenInternal(statefulOperators.scala:949)
> [info]       
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:323)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46676) dropDuplicatesWithinWatermark throws error on canonicalizing plan

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46676:
--

Assignee: Apache Spark

> dropDuplicatesWithinWatermark throws error on canonicalizing plan
> -
>
> Key: SPARK-46676
> URL: https://issues.apache.org/jira/browse/SPARK-46676
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Simply said, this test code fails:
> {code:java}
> test("SPARK-X: canonicalization of 
> StreamingDeduplicateWithinWatermarkExec should work") {
>   withTempDir { checkpoint =>
> val dedupeInputData = MemoryStream[(String, Int)]
> val dedupe = dedupeInputData.toDS()
>   .withColumn("eventTime", timestamp_seconds($"_2"))
>   .withWatermark("eventTime", "10 second")
>   .dropDuplicatesWithinWatermark("_1")
>   .select($"_1", $"eventTime".cast("long").as[Long])
> testStream(dedupe, Append)(
>   StartStream(checkpointLocation = checkpoint.getCanonicalPath),
>   AddData(dedupeInputData, "a" -> 1),
>   CheckNewAnswer("a" -> 1),
>   Execute { q =>
> // This threw out error!
> q.lastExecution.executedPlan.canonicalized
>   }
> )
>   }
> } {code}
> with below error:
> {code:java}
> [info] - SPARK-X: canonicalization of 
> StreamingDeduplicateWithinWatermarkExec should work *** FAILED *** (1 second, 
> 237 milliseconds)
> [info]   Assert on query failed: Execute: None.get
> [info]   scala.None$.get(Option.scala:627)
> [info]       scala.None$.get(Option.scala:626)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.(statefulOperators.scala:1101)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.copy(statefulOperators.scala:1092)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1148)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1087)
> [info]       
> org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1210)
> [info]       
> org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1208)
> [info]       
> org.apache.spark.sql.execution.streaming.BaseStreamingDeduplicateExec.withNewChildrenInternal(statefulOperators.scala:949)
> [info]       
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:323)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46641) Add maxBytesPerTrigger threshold option

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46641:
--

Assignee: (was: Apache Spark)

> Add maxBytesPerTrigger threshold option
> ---
>
> Key: SPARK-46641
> URL: https://issues.apache.org/jira/browse/SPARK-46641
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Maksim Konstantinov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46676) dropDuplicatesWithinWatermark throws error on canonicalizing plan

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46676:
---
Labels: pull-request-available  (was: )

> dropDuplicatesWithinWatermark throws error on canonicalizing plan
> -
>
> Key: SPARK-46676
> URL: https://issues.apache.org/jira/browse/SPARK-46676
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>  Labels: pull-request-available
>
> Simply said, this test code fails:
> {code:java}
> test("SPARK-X: canonicalization of 
> StreamingDeduplicateWithinWatermarkExec should work") {
>   withTempDir { checkpoint =>
> val dedupeInputData = MemoryStream[(String, Int)]
> val dedupe = dedupeInputData.toDS()
>   .withColumn("eventTime", timestamp_seconds($"_2"))
>   .withWatermark("eventTime", "10 second")
>   .dropDuplicatesWithinWatermark("_1")
>   .select($"_1", $"eventTime".cast("long").as[Long])
> testStream(dedupe, Append)(
>   StartStream(checkpointLocation = checkpoint.getCanonicalPath),
>   AddData(dedupeInputData, "a" -> 1),
>   CheckNewAnswer("a" -> 1),
>   Execute { q =>
> // This threw out error!
> q.lastExecution.executedPlan.canonicalized
>   }
> )
>   }
> } {code}
> with below error:
> {code:java}
> [info] - SPARK-X: canonicalization of 
> StreamingDeduplicateWithinWatermarkExec should work *** FAILED *** (1 second, 
> 237 milliseconds)
> [info]   Assert on query failed: Execute: None.get
> [info]   scala.None$.get(Option.scala:627)
> [info]       scala.None$.get(Option.scala:626)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.(statefulOperators.scala:1101)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.copy(statefulOperators.scala:1092)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1148)
> [info]       
> org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1087)
> [info]       
> org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1210)
> [info]       
> org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1208)
> [info]       
> org.apache.spark.sql.execution.streaming.BaseStreamingDeduplicateExec.withNewChildrenInternal(statefulOperators.scala:949)
> [info]       
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:323)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46641) Add maxBytesPerTrigger threshold option

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46641:
--

Assignee: Apache Spark

> Add maxBytesPerTrigger threshold option
> ---
>
> Key: SPARK-46641
> URL: https://issues.apache.org/jira/browse/SPARK-46641
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Maksim Konstantinov
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46665) Remove Pandas dependency for pyspark.testing

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46665:
--

Assignee: Apache Spark

> Remove Pandas dependency for pyspark.testing
> 
>
> Key: SPARK-46665
> URL: https://issues.apache.org/jira/browse/SPARK-46665
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> We should not make pyspark.testing depending on Pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46665) Remove Pandas dependency for pyspark.testing

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46665:
--

Assignee: (was: Apache Spark)

> Remove Pandas dependency for pyspark.testing
> 
>
> Key: SPARK-46665
> URL: https://issues.apache.org/jira/browse/SPARK-46665
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> We should not make pyspark.testing depending on Pandas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46641) Add maxBytesPerTrigger threshold option

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46641:
--

Assignee: (was: Apache Spark)

> Add maxBytesPerTrigger threshold option
> ---
>
> Key: SPARK-46641
> URL: https://issues.apache.org/jira/browse/SPARK-46641
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Maksim Konstantinov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46675) Remove unused inferTimestampNTZ in ParquetReadSupport

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46675:
--

Assignee: Apache Spark

> Remove unused inferTimestampNTZ in ParquetReadSupport
> -
>
> Key: SPARK-46675
> URL: https://issues.apache.org/jira/browse/SPARK-46675
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46660) ReattachExecute requests do not refresh aliveness of SessionHolder

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46660:
--

Assignee: (was: Apache Spark)

> ReattachExecute requests do not refresh aliveness of SessionHolder
> --
>
> Key: SPARK-46660
> URL: https://issues.apache.org/jira/browse/SPARK-46660
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Venkata Sai Akhil Gudesa
>Priority: Major
>  Labels: pull-request-available
>
> In the first executePlan request, creating the {{ExecuteHolder}} triggers  
> {{getOrCreateIsolatedSession}} which refreshes the aliveness of 
> {{{}SessionHolder{}}}. However in {{ReattachExecute}} , we fetch the 
> {{ExecuteHolder}} directly without going through the {{SessionHolder}} (and 
> hence making it seem like the {{SessionHolder}} is idle).
>  
> This would result in long-running queries (which do not send release execute 
> requests since that refreshes aliveness) failing because the 
> {{SessionHolder}} would expire during active query execution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46675) Remove unused inferTimestampNTZ in ParquetReadSupport

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46675:
--

Assignee: (was: Apache Spark)

> Remove unused inferTimestampNTZ in ParquetReadSupport
> -
>
> Key: SPARK-46675
> URL: https://issues.apache.org/jira/browse/SPARK-46675
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46660) ReattachExecute requests do not refresh aliveness of SessionHolder

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46660:
--

Assignee: Apache Spark

> ReattachExecute requests do not refresh aliveness of SessionHolder
> --
>
> Key: SPARK-46660
> URL: https://issues.apache.org/jira/browse/SPARK-46660
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> In the first executePlan request, creating the {{ExecuteHolder}} triggers  
> {{getOrCreateIsolatedSession}} which refreshes the aliveness of 
> {{{}SessionHolder{}}}. However in {{ReattachExecute}} , we fetch the 
> {{ExecuteHolder}} directly without going through the {{SessionHolder}} (and 
> hence making it seem like the {{SessionHolder}} is idle).
>  
> This would result in long-running queries (which do not send release execute 
> requests since that refreshes aliveness) failing because the 
> {{SessionHolder}} would expire during active query execution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46675) Remove unused inferTimestampNTZ in ParquetReadSupport

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46675:
--

Assignee: (was: Apache Spark)

> Remove unused inferTimestampNTZ in ParquetReadSupport
> -
>
> Key: SPARK-46675
> URL: https://issues.apache.org/jira/browse/SPARK-46675
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46675) Remove unused inferTimestampNTZ in ParquetReadSupport

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-46675:
--

Assignee: Apache Spark

> Remove unused inferTimestampNTZ in ParquetReadSupport
> -
>
> Key: SPARK-46675
> URL: https://issues.apache.org/jira/browse/SPARK-46675
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46676) dropDuplicatesWithinWatermark throws error on canonicalizing plan

2024-01-11 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-46676:


 Summary: dropDuplicatesWithinWatermark throws error on 
canonicalizing plan
 Key: SPARK-46676
 URL: https://issues.apache.org/jira/browse/SPARK-46676
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.5.0, 4.0.0
Reporter: Jungtaek Lim


Simply said, this test code fails:
{code:java}
test("SPARK-X: canonicalization of StreamingDeduplicateWithinWatermarkExec 
should work") {
  withTempDir { checkpoint =>
val dedupeInputData = MemoryStream[(String, Int)]
val dedupe = dedupeInputData.toDS()
  .withColumn("eventTime", timestamp_seconds($"_2"))
  .withWatermark("eventTime", "10 second")
  .dropDuplicatesWithinWatermark("_1")
  .select($"_1", $"eventTime".cast("long").as[Long])

testStream(dedupe, Append)(
  StartStream(checkpointLocation = checkpoint.getCanonicalPath),

  AddData(dedupeInputData, "a" -> 1),
  CheckNewAnswer("a" -> 1),

  Execute { q =>
// This threw out error!
q.lastExecution.executedPlan.canonicalized
  }
)
  }
} {code}
with below error:
{code:java}
[info] - SPARK-X: canonicalization of 
StreamingDeduplicateWithinWatermarkExec should work *** FAILED *** (1 second, 
237 milliseconds)
[info]   Assert on query failed: Execute: None.get
[info]   scala.None$.get(Option.scala:627)
[info]       scala.None$.get(Option.scala:626)
[info]       
org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.(statefulOperators.scala:1101)
[info]       
org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.copy(statefulOperators.scala:1092)
[info]       
org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1148)
[info]       
org.apache.spark.sql.execution.streaming.StreamingDeduplicateWithinWatermarkExec.withNewChildInternal(statefulOperators.scala:1087)
[info]       
org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal(TreeNode.scala:1210)
[info]       
org.apache.spark.sql.catalyst.trees.UnaryLike.withNewChildrenInternal$(TreeNode.scala:1208)
[info]       
org.apache.spark.sql.execution.streaming.BaseStreamingDeduplicateExec.withNewChildrenInternal(statefulOperators.scala:949)
[info]       
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:323)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46671) InferFiltersFromConstraint rule is creating a redundant filter

2024-01-11 Thread Asif (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif resolved SPARK-46671.
--
Resolution: Not A Bug

> InferFiltersFromConstraint rule is creating a redundant filter
> --
>
> Key: SPARK-46671
> URL: https://issues.apache.org/jira/browse/SPARK-46671
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Minor
>  Labels: SQL, catalyst
>
> while bring my old PR which uses a different approach to  the 
> ConstraintPropagation algorithm ( 
> [SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch 
> with current master, I noticed a test failure in my branch for SPARK-33152:
> The test which is failing is
> InferFiltersFromConstraintSuite:
> {code}
>   test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: 
> Infer Filters") {
> val x = testRelation.as("x")
> val y = testRelation.as("y")
> val z = testRelation.as("z")
> // Removes EqualNullSafe when constructing candidate constraints
> comparePlans(
>   InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
> .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
>   x.select($"x.a", $"x.a".as("xa"))
> .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && 
> $"xa" === $"x.a").analyze)
> // Once strategy's idempotence is not broken
> val originalQuery =
>   x.join(y, condition = Some($"x.a" === $"y.a"))
> .select($"x.a", $"x.a".as("xa")).as("xy")
> .join(z, condition = Some($"xy.a" === $"z.a")).analyze
> val correctAnswer =
>   x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = 
> Some($"x.a" === $"y.a"))
> .select($"x.a", $"x.a".as("xa")).as("xy")
> .join(z.where($"a".isNotNull), condition = Some($"xy.a" === 
> $"z.a")).analyze
> val optimizedQuery = InferFiltersFromConstraints(originalQuery)
> comparePlans(optimizedQuery, correctAnswer)
> comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer)
>   }
> {code}
> In the above test, I believe the below assertion is not proper.
> There is a redundant filter which is getting created.
> Out of these two isNotNull constraints,  only one should be created.
> $"xa".isNotNull && $"x.a".isNotNull 
>  Because presence of (xa#0 = a#0), automatically implies that is one 
> attribute is not null, the other also has to be not null.
>   // Removes EqualNullSafe when constructing candidate constraints
> comparePlans(
>   InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
> .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
>   x.select($"x.a", $"x.a".as("xa"))
> .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && 
> $"xa" === $"x.a").analyze) 
> This is not a big issue, but it highlights the need to take a relook at the 
> code of ConstraintPropagation and related code.
> I am filing this jira so that constraint code can be tightened/made more 
> robust.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46671) InferFiltersFromConstraint rule is creating a redundant filter

2024-01-11 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805434#comment-17805434
 ] 

Asif commented on SPARK-46671:
--

on further thoughts , I am wrong.. There should be 2 separate isNotNull 
constraints..

> InferFiltersFromConstraint rule is creating a redundant filter
> --
>
> Key: SPARK-46671
> URL: https://issues.apache.org/jira/browse/SPARK-46671
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Minor
>  Labels: SQL, catalyst
>
> while bring my old PR which uses a different approach to  the 
> ConstraintPropagation algorithm ( 
> [SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch 
> with current master, I noticed a test failure in my branch for SPARK-33152:
> The test which is failing is
> InferFiltersFromConstraintSuite:
> {code}
>   test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: 
> Infer Filters") {
> val x = testRelation.as("x")
> val y = testRelation.as("y")
> val z = testRelation.as("z")
> // Removes EqualNullSafe when constructing candidate constraints
> comparePlans(
>   InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
> .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
>   x.select($"x.a", $"x.a".as("xa"))
> .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && 
> $"xa" === $"x.a").analyze)
> // Once strategy's idempotence is not broken
> val originalQuery =
>   x.join(y, condition = Some($"x.a" === $"y.a"))
> .select($"x.a", $"x.a".as("xa")).as("xy")
> .join(z, condition = Some($"xy.a" === $"z.a")).analyze
> val correctAnswer =
>   x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = 
> Some($"x.a" === $"y.a"))
> .select($"x.a", $"x.a".as("xa")).as("xy")
> .join(z.where($"a".isNotNull), condition = Some($"xy.a" === 
> $"z.a")).analyze
> val optimizedQuery = InferFiltersFromConstraints(originalQuery)
> comparePlans(optimizedQuery, correctAnswer)
> comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer)
>   }
> {code}
> In the above test, I believe the below assertion is not proper.
> There is a redundant filter which is getting created.
> Out of these two isNotNull constraints,  only one should be created.
> $"xa".isNotNull && $"x.a".isNotNull 
>  Because presence of (xa#0 = a#0), automatically implies that is one 
> attribute is not null, the other also has to be not null.
>   // Removes EqualNullSafe when constructing candidate constraints
> comparePlans(
>   InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
> .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
>   x.select($"x.a", $"x.a".as("xa"))
> .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && 
> $"xa" === $"x.a").analyze) 
> This is not a big issue, but it highlights the need to take a relook at the 
> code of ConstraintPropagation and related code.
> I am filing this jira so that constraint code can be tightened/made more 
> robust.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46671) InferFiltersFromConstraint rule is creating a redundant filter

2024-01-11 Thread Asif (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805435#comment-17805435
 ] 

Asif commented on SPARK-46671:
--

so closing the ticket

> InferFiltersFromConstraint rule is creating a redundant filter
> --
>
> Key: SPARK-46671
> URL: https://issues.apache.org/jira/browse/SPARK-46671
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Minor
>  Labels: SQL, catalyst
>
> while bring my old PR which uses a different approach to  the 
> ConstraintPropagation algorithm ( 
> [SPARK-33152|https://issues.apache.org/jira/browse/SPARK-33152]) in synch 
> with current master, I noticed a test failure in my branch for SPARK-33152:
> The test which is failing is
> InferFiltersFromConstraintSuite:
> {code}
>   test("SPARK-43095: Avoid Once strategy's idempotence is broken for batch: 
> Infer Filters") {
> val x = testRelation.as("x")
> val y = testRelation.as("y")
> val z = testRelation.as("z")
> // Removes EqualNullSafe when constructing candidate constraints
> comparePlans(
>   InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
> .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
>   x.select($"x.a", $"x.a".as("xa"))
> .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && 
> $"xa" === $"x.a").analyze)
> // Once strategy's idempotence is not broken
> val originalQuery =
>   x.join(y, condition = Some($"x.a" === $"y.a"))
> .select($"x.a", $"x.a".as("xa")).as("xy")
> .join(z, condition = Some($"xy.a" === $"z.a")).analyze
> val correctAnswer =
>   x.where($"a".isNotNull).join(y.where($"a".isNotNull), condition = 
> Some($"x.a" === $"y.a"))
> .select($"x.a", $"x.a".as("xa")).as("xy")
> .join(z.where($"a".isNotNull), condition = Some($"xy.a" === 
> $"z.a")).analyze
> val optimizedQuery = InferFiltersFromConstraints(originalQuery)
> comparePlans(optimizedQuery, correctAnswer)
> comparePlans(InferFiltersFromConstraints(optimizedQuery), correctAnswer)
>   }
> {code}
> In the above test, I believe the below assertion is not proper.
> There is a redundant filter which is getting created.
> Out of these two isNotNull constraints,  only one should be created.
> $"xa".isNotNull && $"x.a".isNotNull 
>  Because presence of (xa#0 = a#0), automatically implies that is one 
> attribute is not null, the other also has to be not null.
>   // Removes EqualNullSafe when constructing candidate constraints
> comparePlans(
>   InferFiltersFromConstraints(x.select($"x.a", $"x.a".as("xa"))
> .where($"xa" <=> $"x.a" && $"xa" === $"x.a").analyze),
>   x.select($"x.a", $"x.a".as("xa"))
> .where($"xa".isNotNull && $"x.a".isNotNull && $"xa" <=> $"x.a" && 
> $"xa" === $"x.a").analyze) 
> This is not a big issue, but it highlights the need to take a relook at the 
> code of ConstraintPropagation and related code.
> I am filing this jira so that constraint code can be tightened/made more 
> robust.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46675) Remove unused inferTimestampNTZ in ParquetReadSupport

2024-01-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46675:
---
Labels: pull-request-available  (was: )

> Remove unused inferTimestampNTZ in ParquetReadSupport
> -
>
> Key: SPARK-46675
> URL: https://issues.apache.org/jira/browse/SPARK-46675
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46675) Remove unused inferTimestampNTZ in ParquetReadSupport

2024-01-11 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-46675:
-

 Summary: Remove unused inferTimestampNTZ in ParquetReadSupport
 Key: SPARK-46675
 URL: https://issues.apache.org/jira/browse/SPARK-46675
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46668) Parallelize Sphinx build of Python API docs

2024-01-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-46668:


Assignee: Nicholas Chammas

> Parallelize Sphinx build of Python API docs
> ---
>
> Key: SPARK-46668
> URL: https://issues.apache.org/jira/browse/SPARK-46668
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46668) Parallelize Sphinx build of Python API docs

2024-01-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-46668.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44680
[https://github.com/apache/spark/pull/44680]

> Parallelize Sphinx build of Python API docs
> ---
>
> Key: SPARK-46668
> URL: https://issues.apache.org/jira/browse/SPARK-46668
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

98 matches

Mail list logo