[jira] [Commented] (SPARK-45002) Avoid uncaught exception from state store maintenance task thread on error
[ https://issues.apache.org/jira/browse/SPARK-45002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759829#comment-17759829 ] Anish Shrigondekar commented on SPARK-45002: PR here: [https://github.com/apache/spark/pull/42716] cc - [~kabhwan] > Avoid uncaught exception from state store maintenance task thread on error > -- > > Key: SPARK-45002 > URL: https://issues.apache.org/jira/browse/SPARK-45002 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.5.1 >Reporter: Anish Shrigondekar >Priority: Major > > Avoid uncaught exception from state store maintenance task thread on error -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45003) Refine docstring of `asc/desc`
Yang Jie created SPARK-45003: Summary: Refine docstring of `asc/desc` Key: SPARK-45003 URL: https://issues.apache.org/jira/browse/SPARK-45003 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45002) Avoid uncaught exception from state store maintenance task thread on error
Anish Shrigondekar created SPARK-45002: -- Summary: Avoid uncaught exception from state store maintenance task thread on error Key: SPARK-45002 URL: https://issues.apache.org/jira/browse/SPARK-45002 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 3.5.1 Reporter: Anish Shrigondekar Avoid uncaught exception from state store maintenance task thread on error -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44996) VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed
[ https://issues.apache.org/jira/browse/SPARK-44996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44996. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42710 [https://github.com/apache/spark/pull/42710] > VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed > - > > Key: SPARK-44996 > URL: https://issues.apache.org/jira/browse/SPARK-44996 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 4.0.0 > > > Since `VolcanoFeatureStep` creates `DefaultVolcanoClient` always, the unit > test suite `VolcanoFeatureStepSuite` behaves like an integration test. In > other words, it fails when there is no backend K8s clusters. > {code} > $ build/sbt -Pkubernetes -Pvolcano "kubernetes/testOnly *Volcano* -- -z > SPARK-36061" > ... > [info] VolcanoFeatureStepSuite: > [info] - SPARK-36061: Driver Pod with Volcano PodGroup *** FAILED *** (646 > milliseconds) > [info] org.snakeyaml.engine.v2.exceptions.ScannerException: mapping values > are not allowed here > [info] in reader, line 1, column 94: > [info] ... well-known/openid-configuration": dial tcp: lookup iam.corp. > ... > [info] ^ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44996) VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed
[ https://issues.apache.org/jira/browse/SPARK-44996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44996: - Assignee: Dongjoon Hyun > VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed > - > > Key: SPARK-44996 > URL: https://issues.apache.org/jira/browse/SPARK-44996 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > > Since `VolcanoFeatureStep` creates `DefaultVolcanoClient` always, the unit > test suite `VolcanoFeatureStepSuite` behaves like an integration test. In > other words, it fails when there is no backend K8s clusters. > {code} > $ build/sbt -Pkubernetes -Pvolcano "kubernetes/testOnly *Volcano* -- -z > SPARK-36061" > ... > [info] VolcanoFeatureStepSuite: > [info] - SPARK-36061: Driver Pod with Volcano PodGroup *** FAILED *** (646 > milliseconds) > [info] org.snakeyaml.engine.v2.exceptions.ScannerException: mapping values > are not allowed here > [info] in reader, line 1, column 94: > [info] ... well-known/openid-configuration": dial tcp: lookup iam.corp. > ... > [info] ^ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44999) Refactor ExternalSorter to reduce checks on shouldPartition when calling getPartition
[ https://issues.apache.org/jira/browse/SPARK-44999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-44999: - Summary: Refactor ExternalSorter to reduce checks on shouldPartition when calling getPartition (was: Refactor ExternalSorter#getPartition to reduce checks on shouldPartition) > Refactor ExternalSorter to reduce checks on shouldPartition when calling > getPartition > - > > Key: SPARK-44999 > URL: https://issues.apache.org/jira/browse/SPARK-44999 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > > {code:java} > private def getPartition(key: K): Int = { > if (shouldPartition) partitioner.get.getPartition(key) else 0 > } {code} > > The {{getPartition}} method checks {{shouldPartition}} every time it is > called. However, {{shouldPartition}} should not be able to change after the > {{ExternalSorter}} is instantiated. Therefore, it can be refactored to reduce > the checks on {{{}shouldPartition{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing
[ https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759821#comment-17759821 ] Yauheni Audzeichyk commented on SPARK-44900: [~yxzhang] looks like it is just disk usage tracking issue as disk space is not used as much. However it affects effectiveness of cached data since Spark spills it to disk as it believes it doesn't fit memory anymore so eventually it becomes 100% stored on disk. > Cached DataFrame keeps growing > -- > > Key: SPARK-44900 > URL: https://issues.apache.org/jira/browse/SPARK-44900 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Varun Nalla >Priority: Blocker > > Scenario : > We have a kafka streaming application where the data lookups are happening by > joining another DF which is cached, and the caching strategy is > MEMORY_AND_DISK. > However the size of the cached DataFrame keeps on growing for every micro > batch the streaming application process and that's being visible under > storage tab. > A similar stack overflow thread was already raised. > https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44999) Refactor ExternalSorter#getPartition to reduce checks on shouldPartition
[ https://issues.apache.org/jira/browse/SPARK-44999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-44999: - Summary: Refactor ExternalSorter#getPartition to reduce checks on shouldPartition (was: Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` judgments) > Refactor ExternalSorter#getPartition to reduce checks on shouldPartition > > > Key: SPARK-44999 > URL: https://issues.apache.org/jira/browse/SPARK-44999 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > > {code:java} > private def getPartition(key: K): Int = { > if (shouldPartition) partitioner.get.getPartition(key) else 0 > } {code} > > The {{getPartition}} method checks {{shouldPartition}} every time it is > called. However, {{shouldPartition}} should not be able to change after the > {{ExternalSorter}} is instantiated. Therefore, it can be refactored to reduce > the checks on {{{}shouldPartition{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing
[ https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759818#comment-17759818 ] Yuexin Zhang commented on SPARK-44900: -- Hi [~varun2807] [~yaud] did you check the actual cached file size on disk, on the yarn node manager local filesystem? Is it really ever growing? > Cached DataFrame keeps growing > -- > > Key: SPARK-44900 > URL: https://issues.apache.org/jira/browse/SPARK-44900 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Varun Nalla >Priority: Blocker > > Scenario : > We have a kafka streaming application where the data lookups are happening by > joining another DF which is cached, and the caching strategy is > MEMORY_AND_DISK. > However the size of the cached DataFrame keeps on growing for every micro > batch the streaming application process and that's being visible under > storage tab. > A similar stack overflow thread was already raised. > https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45001) Implement DataFrame.foreachPartition
Hyukjin Kwon created SPARK-45001: Summary: Implement DataFrame.foreachPartition Key: SPARK-45001 URL: https://issues.apache.org/jira/browse/SPARK-45001 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41279) Feature parity: DataFrame API in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-41279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759803#comment-17759803 ] Hyukjin Kwon commented on SPARK-41279: -- See also https://github.com/apache/spark/pull/42714 > Feature parity: DataFrame API in Spark Connect > -- > > Key: SPARK-41279 > URL: https://issues.apache.org/jira/browse/SPARK-41279 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Ruifeng Zheng >Priority: Critical > > Implement DataFrame API in Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45000) Implement DataFrame.foreach
[ https://issues.apache.org/jira/browse/SPARK-45000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759802#comment-17759802 ] Snoot.io commented on SPARK-45000: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/42714 > Implement DataFrame.foreach > --- > > Key: SPARK-45000 > URL: https://issues.apache.org/jira/browse/SPARK-45000 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45000) Implement DataFrame.foreach
Hyukjin Kwon created SPARK-45000: Summary: Implement DataFrame.foreach Key: SPARK-45000 URL: https://issues.apache.org/jira/browse/SPARK-45000 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44999) Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` judgments
[ https://issues.apache.org/jira/browse/SPARK-44999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-44999: - Description: {code:java} private def getPartition(key: K): Int = { if (shouldPartition) partitioner.get.getPartition(key) else 0 } {code} The {{getPartition}} method checks {{shouldPartition}} every time it is called. However, {{shouldPartition}} should not be able to change after the {{ExternalSorter}} is instantiated. Therefore, it can be refactored to reduce the checks on {{{}shouldPartition{}}}. was: {code:java} private def getPartition(key: K): Int = { if (shouldPartition) partitioner.get.getPartition(key) else 0 } {code} > Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` > judgments > -- > > Key: SPARK-44999 > URL: https://issues.apache.org/jira/browse/SPARK-44999 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > > {code:java} > private def getPartition(key: K): Int = { > if (shouldPartition) partitioner.get.getPartition(key) else 0 > } {code} > > The {{getPartition}} method checks {{shouldPartition}} every time it is > called. However, {{shouldPartition}} should not be able to change after the > {{ExternalSorter}} is instantiated. Therefore, it can be refactored to reduce > the checks on {{{}shouldPartition{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44999) Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` judgments
[ https://issues.apache.org/jira/browse/SPARK-44999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759799#comment-17759799 ] Snoot.io commented on SPARK-44999: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/42713 > Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` > judgments > -- > > Key: SPARK-44999 > URL: https://issues.apache.org/jira/browse/SPARK-44999 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > > {code:java} > private def getPartition(key: K): Int = { > if (shouldPartition) partitioner.get.getPartition(key) else 0 > } {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44999) Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` judgments
[ https://issues.apache.org/jira/browse/SPARK-44999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759798#comment-17759798 ] Snoot.io commented on SPARK-44999: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/42713 > Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` > judgments > -- > > Key: SPARK-44999 > URL: https://issues.apache.org/jira/browse/SPARK-44999 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > > {code:java} > private def getPartition(key: K): Int = { > if (shouldPartition) partitioner.get.getPartition(key) else 0 > } {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44999) Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` judgments
[ https://issues.apache.org/jira/browse/SPARK-44999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-44999: - Description: {code:java} private def getPartition(key: K): Int = { if (shouldPartition) partitioner.get.getPartition(key) else 0 } {code} > Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` > judgments > -- > > Key: SPARK-44999 > URL: https://issues.apache.org/jira/browse/SPARK-44999 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > > {code:java} > private def getPartition(key: K): Int = { > if (shouldPartition) partitioner.get.getPartition(key) else 0 > } {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44999) Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` judgments
Yang Jie created SPARK-44999: Summary: Refactor `ExternalSorter#getPartition` to reduce the number of i`f else` judgments Key: SPARK-44999 URL: https://issues.apache.org/jira/browse/SPARK-44999 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44997) Align example order (Python -> Scala/Java -> R) in all Spark Doc Content
[ https://issues.apache.org/jira/browse/SPARK-44997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759794#comment-17759794 ] Snoot.io commented on SPARK-44997: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/42712 > Align example order (Python -> Scala/Java -> R) in all Spark Doc Content > > > Key: SPARK-44997 > URL: https://issues.apache.org/jira/browse/SPARK-44997 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41279) Feature parity: DataFrame API in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-41279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759789#comment-17759789 ] Hyukjin Kwon commented on SPARK-41279: -- You can run {code} def wrapped(itr): for pandas_df in itr: yield pandas_df.applymap(your_func) df.mapInPandas(wrapped, schema=...) {code} > Feature parity: DataFrame API in Spark Connect > -- > > Key: SPARK-41279 > URL: https://issues.apache.org/jira/browse/SPARK-41279 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Ruifeng Zheng >Priority: Critical > > Implement DataFrame API in Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43646) Make `connect` module daily test pass
[ https://issues.apache.org/jira/browse/SPARK-43646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-43646: Assignee: Yang Jie > Make `connect` module daily test pass > - > > Key: SPARK-43646 > URL: https://issues.apache.org/jira/browse/SPARK-43646 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > run > {code:java} > build/mvn clean install -DskipTests > build/mvn test -pl connector/connect/server {code} > {code:java} > - from_protobuf_messageClassName *** FAILED *** > org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS] Could > not load Protobuf class with name > org.apache.spark.connect.proto.StorageLevel. > org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf > Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar with > Protobuf classes needs to be shaded (com.google.protobuf.* --> > org.sparkproject.spark_protobuf.protobuf.*). > at > org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3417) > at > org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:193) > at > org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:151) > at > org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58) > at > org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57) > at > org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43) > at > org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42) > at > org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:72) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > - from_protobuf_messageClassName_options *** FAILED *** > org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS] Could > not load Protobuf class with name > org.apache.spark.connect.proto.StorageLevel. > org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf > Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar with > Protobuf classes needs to be shaded (com.google.protobuf.* --> > org.sparkproject.spark_protobuf.protobuf.*). > at > org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3417) > at > org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:193) > at > org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:151) > at > org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58) > at > org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57) > at > org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43) > at > org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42) > at > org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:72) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43646) Make `connect` module daily test pass
[ https://issues.apache.org/jira/browse/SPARK-43646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-43646. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42236 [https://github.com/apache/spark/pull/42236] > Make `connect` module daily test pass > - > > Key: SPARK-43646 > URL: https://issues.apache.org/jira/browse/SPARK-43646 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > run > {code:java} > build/mvn clean install -DskipTests > build/mvn test -pl connector/connect/server {code} > {code:java} > - from_protobuf_messageClassName *** FAILED *** > org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS] Could > not load Protobuf class with name > org.apache.spark.connect.proto.StorageLevel. > org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf > Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar with > Protobuf classes needs to be shaded (com.google.protobuf.* --> > org.sparkproject.spark_protobuf.protobuf.*). > at > org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3417) > at > org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:193) > at > org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:151) > at > org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58) > at > org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57) > at > org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43) > at > org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42) > at > org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:72) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > - from_protobuf_messageClassName_options *** FAILED *** > org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS] Could > not load Protobuf class with name > org.apache.spark.connect.proto.StorageLevel. > org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf > Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar with > Protobuf classes needs to be shaded (com.google.protobuf.* --> > org.sparkproject.spark_protobuf.protobuf.*). > at > org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3417) > at > org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:193) > at > org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:151) > at > org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58) > at > org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57) > at > org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43) > at > org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42) > at > org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:72) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44998) No need to retry parsing event log path again when FileNotFoundException occurs
[ https://issues.apache.org/jira/browse/SPARK-44998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhen Wang updated SPARK-44998: -- Description: I found a lot of retry parsing inprogress event log records in history server log. The application is already done while parsing, so we don't need to retry parsing it again when FileNotFoundException occurs. !image-2023-08-29-10-47-08-027.png! !image-2023-08-29-10-47-43-567.png! was: I found a lot of retry parsing inprogress event log records in history server log. The application is already done while parsing, so we don't need to retry parsing it again when FileNotFoundException occurs. !image-2023-08-29-10-43-21-991.png! !image-2023-08-29-10-44-34-375.png! > No need to retry parsing event log path again when FileNotFoundException > occurs > --- > > Key: SPARK-44998 > URL: https://issues.apache.org/jira/browse/SPARK-44998 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Zhen Wang >Priority: Minor > Attachments: image-2023-08-29-10-47-08-027.png, > image-2023-08-29-10-47-43-567.png > > > I found a lot of retry parsing inprogress event log records in history server > log. The application is already done while parsing, so we don't need to retry > parsing it again when FileNotFoundException occurs. > > !image-2023-08-29-10-47-08-027.png! > !image-2023-08-29-10-47-43-567.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44998) No need to retry parsing event log path again when FileNotFoundException occurs
[ https://issues.apache.org/jira/browse/SPARK-44998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhen Wang updated SPARK-44998: -- Attachment: image-2023-08-29-10-47-08-027.png > No need to retry parsing event log path again when FileNotFoundException > occurs > --- > > Key: SPARK-44998 > URL: https://issues.apache.org/jira/browse/SPARK-44998 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Zhen Wang >Priority: Minor > Attachments: image-2023-08-29-10-47-08-027.png, > image-2023-08-29-10-47-43-567.png > > > I found a lot of retry parsing inprogress event log records in history server > log. The application is already done while parsing, so we don't need to retry > parsing it again when FileNotFoundException occurs. > > !image-2023-08-29-10-43-21-991.png! > !image-2023-08-29-10-44-34-375.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44998) No need to retry parsing event log path again when FileNotFoundException occurs
[ https://issues.apache.org/jira/browse/SPARK-44998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhen Wang updated SPARK-44998: -- Attachment: image-2023-08-29-10-47-43-567.png > No need to retry parsing event log path again when FileNotFoundException > occurs > --- > > Key: SPARK-44998 > URL: https://issues.apache.org/jira/browse/SPARK-44998 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: Zhen Wang >Priority: Minor > Attachments: image-2023-08-29-10-47-08-027.png, > image-2023-08-29-10-47-43-567.png > > > I found a lot of retry parsing inprogress event log records in history server > log. The application is already done while parsing, so we don't need to retry > parsing it again when FileNotFoundException occurs. > > !image-2023-08-29-10-43-21-991.png! > !image-2023-08-29-10-44-34-375.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44998) No need to retry parsing event log path again when FileNotFoundException occurs
Zhen Wang created SPARK-44998: - Summary: No need to retry parsing event log path again when FileNotFoundException occurs Key: SPARK-44998 URL: https://issues.apache.org/jira/browse/SPARK-44998 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.1 Reporter: Zhen Wang I found a lot of retry parsing inprogress event log records in history server log. The application is already done while parsing, so we don't need to retry parsing it again when FileNotFoundException occurs. !image-2023-08-29-10-43-21-991.png! !image-2023-08-29-10-44-34-375.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44860) Implement SESSION_USER function
[ https://issues.apache.org/jira/browse/SPARK-44860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-44860. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42549 [https://github.com/apache/spark/pull/42549] > Implement SESSION_USER function > --- > > Key: SPARK-44860 > URL: https://issues.apache.org/jira/browse/SPARK-44860 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Vitalii Li >Assignee: Vitalii Li >Priority: Major > Fix For: 4.0.0 > > > According to SQL standard SESSION_USER and CURRENT_USER behavior differs for > routines: > - CURRENT_USER inside a routine should return security definer of a routine, > e.g. owner identity > - SESSION_USER inside a routine should return connected user. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44965) Hide internal functions/variables from `pyspark.sql.functions`
[ https://issues.apache.org/jira/browse/SPARK-44965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-44965: - Assignee: Ruifeng Zheng > Hide internal functions/variables from `pyspark.sql.functions` > -- > > Key: SPARK-44965 > URL: https://issues.apache.org/jira/browse/SPARK-44965 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44965) Hide internal functions/variables from `pyspark.sql.functions`
[ https://issues.apache.org/jira/browse/SPARK-44965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-44965. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42680 [https://github.com/apache/spark/pull/42680] > Hide internal functions/variables from `pyspark.sql.functions` > -- > > Key: SPARK-44965 > URL: https://issues.apache.org/jira/browse/SPARK-44965 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44997) Align example order (Python -> Scala/Java -> R) in all Spark Doc Content
BingKun Pan created SPARK-44997: --- Summary: Align example order (Python -> Scala/Java -> R) in all Spark Doc Content Key: SPARK-44997 URL: https://issues.apache.org/jira/browse/SPARK-44997 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44995) Promote SparkKubernetesClientFactory to DeveloperApi
[ https://issues.apache.org/jira/browse/SPARK-44995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44995: - Assignee: Dongjoon Hyun > Promote SparkKubernetesClientFactory to DeveloperApi > > > Key: SPARK-44995 > URL: https://issues.apache.org/jira/browse/SPARK-44995 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44995) Promote SparkKubernetesClientFactory to DeveloperApi
[ https://issues.apache.org/jira/browse/SPARK-44995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44995. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42709 [https://github.com/apache/spark/pull/42709] > Promote SparkKubernetesClientFactory to DeveloperApi > > > Key: SPARK-44995 > URL: https://issues.apache.org/jira/browse/SPARK-44995 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44993) Add ShuffleChecksumUtils.compareChecksums by reusing ShuffleChecksumTestHelp.compareChecksums
[ https://issues.apache.org/jira/browse/SPARK-44993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44993: - Assignee: Dongjoon Hyun > Add ShuffleChecksumUtils.compareChecksums by reusing > ShuffleChecksumTestHelp.compareChecksums > - > > Key: SPARK-44993 > URL: https://issues.apache.org/jira/browse/SPARK-44993 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44993) Add ShuffleChecksumUtils.compareChecksums by reusing ShuffleChecksumTestHelp.compareChecksums
[ https://issues.apache.org/jira/browse/SPARK-44993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44993. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42707 [https://github.com/apache/spark/pull/42707] > Add ShuffleChecksumUtils.compareChecksums by reusing > ShuffleChecksumTestHelp.compareChecksums > - > > Key: SPARK-44993 > URL: https://issues.apache.org/jira/browse/SPARK-44993 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44996) VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed
[ https://issues.apache.org/jira/browse/SPARK-44996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44996: -- Description: Since `VolcanoFeatureStep` creates `DefaultVolcanoClient` always, the unit test suite `VolcanoFeatureStepSuite` behaves like an integration test. In other words, it fails when there is no backend K8s clusters. {code} $ build/sbt -Pkubernetes -Pvolcano "kubernetes/testOnly *Volcano* -- -z SPARK-36061" ... [info] VolcanoFeatureStepSuite: [info] - SPARK-36061: Driver Pod with Volcano PodGroup *** FAILED *** (646 milliseconds) [info] org.snakeyaml.engine.v2.exceptions.ScannerException: mapping values are not allowed here [info] in reader, line 1, column 94: [info] ... well-known/openid-configuration": dial tcp: lookup iam.corp. ... [info] ^ {code} was: Since `VolcanoFeatureStep` creates `DefaultVolcanoClient` always, the unit test suite `VolcanoFeatureStepSuite` behaves like an integration test. In other words, it fails when there is no network connectivity. {code} $ build/sbt -Pkubernetes -Pvolcano "kubernetes/testOnly *Volcano* -- -z SPARK-36061" ... [info] VolcanoFeatureStepSuite: [info] - SPARK-36061: Driver Pod with Volcano PodGroup *** FAILED *** (646 milliseconds) [info] org.snakeyaml.engine.v2.exceptions.ScannerException: mapping values are not allowed here [info] in reader, line 1, column 94: [info] ... well-known/openid-configuration": dial tcp: lookup iam.corp. ... [info] ^ {code} > VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed > - > > Key: SPARK-44996 > URL: https://issues.apache.org/jira/browse/SPARK-44996 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since `VolcanoFeatureStep` creates `DefaultVolcanoClient` always, the unit > test suite `VolcanoFeatureStepSuite` behaves like an integration test. In > other words, it fails when there is no backend K8s clusters. > {code} > $ build/sbt -Pkubernetes -Pvolcano "kubernetes/testOnly *Volcano* -- -z > SPARK-36061" > ... > [info] VolcanoFeatureStepSuite: > [info] - SPARK-36061: Driver Pod with Volcano PodGroup *** FAILED *** (646 > milliseconds) > [info] org.snakeyaml.engine.v2.exceptions.ScannerException: mapping values > are not allowed here > [info] in reader, line 1, column 94: > [info] ... well-known/openid-configuration": dial tcp: lookup iam.corp. > ... > [info] ^ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44996) VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed
[ https://issues.apache.org/jira/browse/SPARK-44996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44996: -- Description: Since `VolcanoFeatureStep` creates `DefaultVolcanoClient` always, the unit test suite `VolcanoFeatureStepSuite` behaves like an integration test. In other words, it fails when there is no network connectivity. {code} $ build/sbt -Pkubernetes -Pvolcano "kubernetes/testOnly *Volcano* -- -z SPARK-36061" ... [info] VolcanoFeatureStepSuite: [info] - SPARK-36061: Driver Pod with Volcano PodGroup *** FAILED *** (646 milliseconds) [info] org.snakeyaml.engine.v2.exceptions.ScannerException: mapping values are not allowed here [info] in reader, line 1, column 94: [info] ... well-known/openid-configuration": dial tcp: lookup iam.corp. ... [info] ^ {code} was: Since `VolcanoFeatureStep` creates `DefaultVolcanoClient` always, the unit test suite `VolcanoFeatureStepSuite` behaves like an integration test. In other words, it fails when there is no network connectivity. {code} [info] VolcanoFeatureStepSuite: [info] - SPARK-36061: Driver Pod with Volcano PodGroup *** FAILED *** (646 milliseconds) [info] org.snakeyaml.engine.v2.exceptions.ScannerException: mapping values are not allowed here [info] in reader, line 1, column 94: [info] ... well-known/openid-configuration": dial tcp: lookup iam.corp. ... [info] ^ {code} > VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed > - > > Key: SPARK-44996 > URL: https://issues.apache.org/jira/browse/SPARK-44996 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since `VolcanoFeatureStep` creates `DefaultVolcanoClient` always, the unit > test suite `VolcanoFeatureStepSuite` behaves like an integration test. In > other words, it fails when there is no network connectivity. > {code} > $ build/sbt -Pkubernetes -Pvolcano "kubernetes/testOnly *Volcano* -- -z > SPARK-36061" > ... > [info] VolcanoFeatureStepSuite: > [info] - SPARK-36061: Driver Pod with Volcano PodGroup *** FAILED *** (646 > milliseconds) > [info] org.snakeyaml.engine.v2.exceptions.ScannerException: mapping values > are not allowed here > [info] in reader, line 1, column 94: > [info] ... well-known/openid-configuration": dial tcp: lookup iam.corp. > ... > [info] ^ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44996) VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed
Dongjoon Hyun created SPARK-44996: - Summary: VolcanoFeatureStep should not create `DefaultVolcanoClient` if not needed Key: SPARK-44996 URL: https://issues.apache.org/jira/browse/SPARK-44996 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 4.0.0 Reporter: Dongjoon Hyun Since `VolcanoFeatureStep` creates `DefaultVolcanoClient` always, the unit test suite `VolcanoFeatureStepSuite` behaves like an integration test. In other words, it fails when there is no network connectivity. {code} [info] VolcanoFeatureStepSuite: [info] - SPARK-36061: Driver Pod with Volcano PodGroup *** FAILED *** (646 milliseconds) [info] org.snakeyaml.engine.v2.exceptions.ScannerException: mapping values are not allowed here [info] in reader, line 1, column 94: [info] ... well-known/openid-configuration": dial tcp: lookup iam.corp. ... [info] ^ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44995) Promote SparkKubernetesClientFactory to DeveloperApi
Dongjoon Hyun created SPARK-44995: - Summary: Promote SparkKubernetesClientFactory to DeveloperApi Key: SPARK-44995 URL: https://issues.apache.org/jira/browse/SPARK-44995 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44994) Refine docstring of `DataFrame.filter`
[ https://issues.apache.org/jira/browse/SPARK-44994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-44994: - Summary: Refine docstring of `DataFrame.filter` (was: Refine docstring for `DataFrame.filter`) > Refine docstring of `DataFrame.filter` > -- > > Key: SPARK-44994 > URL: https://issues.apache.org/jira/browse/SPARK-44994 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Refine the docstring and add more examples for DataFrame.filter -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44993) Add ShuffleChecksumUtils.compareChecksums by reusing ShuffleChecksumTestHelp.compareChecksums
[ https://issues.apache.org/jira/browse/SPARK-44993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44993: -- Summary: Add ShuffleChecksumUtils.compareChecksums by reusing ShuffleChecksumTestHelp.compareChecksums (was: Move compareChecksums from ShuffleChecksumTestHelpe to ShuffleChecksumUtils) > Add ShuffleChecksumUtils.compareChecksums by reusing > ShuffleChecksumTestHelp.compareChecksums > - > > Key: SPARK-44993 > URL: https://issues.apache.org/jira/browse/SPARK-44993 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44994) Refine the docstring of `DataFrame.filter`
Allison Wang created SPARK-44994: Summary: Refine the docstring of `DataFrame.filter` Key: SPARK-44994 URL: https://issues.apache.org/jira/browse/SPARK-44994 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Refine the docstring and add more examples for DataFrame.filter -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44994) Refine docstring for `DataFrame.filter`
[ https://issues.apache.org/jira/browse/SPARK-44994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-44994: - Summary: Refine docstring for `DataFrame.filter` (was: Refine the docstring of `DataFrame.filter`) > Refine docstring for `DataFrame.filter` > --- > > Key: SPARK-44994 > URL: https://issues.apache.org/jira/browse/SPARK-44994 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Refine the docstring and add more examples for DataFrame.filter -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44993) Move compareChecksums from ShuffleChecksumTestHelpe to ShuffleChecksumUtils
[ https://issues.apache.org/jira/browse/SPARK-44993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44993: -- Summary: Move compareChecksums from ShuffleChecksumTestHelpe to ShuffleChecksumUtils (was: Move compareChecksums from ShuffleChecksumTestHelpe to ShuffleChecksumUtils and move compareChecksums) > Move compareChecksums from ShuffleChecksumTestHelpe to ShuffleChecksumUtils > --- > > Key: SPARK-44993 > URL: https://issues.apache.org/jira/browse/SPARK-44993 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44993) Move compareChecksums from ShuffleChecksumTestHelpe to ShuffleChecksumUtils and move compareChecksums
Dongjoon Hyun created SPARK-44993: - Summary: Move compareChecksums from ShuffleChecksumTestHelpe to ShuffleChecksumUtils and move compareChecksums Key: SPARK-44993 URL: https://issues.apache.org/jira/browse/SPARK-44993 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41279) Feature parity: DataFrame API in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-41279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759732#comment-17759732 ] Johannes Alberti commented on SPARK-41279: -- [~gurwls223] thank you for your response. When using `mapInPandas`, is `spark.sql.execution.arrow.maxRecordsPerBatch` always the rowset boundary? Just for my understanding, in `foreach` we have a invocation of `func` per row, in `foreachbatch` we have a invocation of `func` per partition. When moving to `mapInPandas`, (not that this would make much sense, but just for illustration purposes) ... I would need to run with `maxRecordsPerBatch=1` to have the same behavior as `foreach`, ... but otherwise, when running with the default of `maxRecordsPerBatch=10_000`, but likely with better performance than in the past (execution is still distributed in the cluster), ... and I will have more invocations of `func` then partition count, if my partitions are larger than `10_000` rows per partition. Is that understanding correct? Thanks again! > Feature parity: DataFrame API in Spark Connect > -- > > Key: SPARK-41279 > URL: https://issues.apache.org/jira/browse/SPARK-41279 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Ruifeng Zheng >Priority: Critical > > Implement DataFrame API in Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44992) Add support for rack information from an environment variable
Holden Karau created SPARK-44992: Summary: Add support for rack information from an environment variable Key: SPARK-44992 URL: https://issues.apache.org/jira/browse/SPARK-44992 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 4.0.0 Reporter: Holden Karau This would allow us to use things like EC2_AVAILABILITY_ZONE for locality for Kube (or other clusters) which span multiple AZs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44900) Cached DataFrame keeps growing
[ https://issues.apache.org/jira/browse/SPARK-44900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759703#comment-17759703 ] Varun Nalla commented on SPARK-44900: - [~yao] hope you got a chance to look into what [~yaud] mentioned. > Cached DataFrame keeps growing > -- > > Key: SPARK-44900 > URL: https://issues.apache.org/jira/browse/SPARK-44900 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Varun Nalla >Priority: Blocker > > Scenario : > We have a kafka streaming application where the data lookups are happening by > joining another DF which is cached, and the caching strategy is > MEMORY_AND_DISK. > However the size of the cached DataFrame keeps on growing for every micro > batch the streaming application process and that's being visible under > storage tab. > A similar stack overflow thread was already raised. > https://stackoverflow.com/questions/55601779/spark-dataframe-cache-keeps-growing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44991) Spark json schema inference and fromJson api having inconsistent behavior
[ https://issues.apache.org/jira/browse/SPARK-44991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nirav patel updated SPARK-44991: Summary: Spark json schema inference and fromJson api having inconsistent behavior (was: Spark json datasource reader and fromJson api having inconsistent behavior) > Spark json schema inference and fromJson api having inconsistent behavior > - > > Key: SPARK-44991 > URL: https://issues.apache.org/jira/browse/SPARK-44991 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.2 >Reporter: nirav patel >Priority: Major > > Spark json reader can infer datatype of a fields. I am ingesting millions of > datapoints and generating a `DataFrameA`. what i notice that Schema > inference mark datatype of a field with tons of Integers and Empty Strings as > a Long. That is an okay behavior as I don't set `primitivesAsString` cause I > do want primitive type inference. I store `DataFrameA` into `TableA` > Now, this inference behavior is not respected by `fromJson` of `from_json` > api when I am trying to write new data on `TableA`. Means, if I read a chunk > of input data into using > `spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` > reader complains that EmptyString cannot be cast to Long . > `getStruct(TableA)` is psuedo method that returns `struct` of TableA schema > somehow. and `/path/to/more/data` have some value for this fields as an empty > string. > I think if reader doesnt complain about Empty string during schema inference > it shouldn't complain either on reading without inference. May be treat Empty > as Null just like during schema inference or at least give an additional > option - treatEmptyAsNull so it's more explicit for application users? > ps - i marked it as bug but could be more suited as improvements. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44991) Spark json datasource reader and fromJson api having inconsistent behavior
[ https://issues.apache.org/jira/browse/SPARK-44991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nirav patel updated SPARK-44991: Description: Spark json reader can infer datatype of a fields. I am ingesting millions of datapoints and generating a `DataFrameA`. what i notice that Schema inference mark datatype of a field with tons of Integers and Empty Strings as a Long. That is an okay behavior as I don't set `primitivesAsString` cause I do want primitive type inference. I store `DataFrameA` into `TableA` Now, this inference behavior is not respected by `fromJson` of `from_json` api when I am trying to write new data on `TableA`. Means, if I read a chunk of input data into using `spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` reader complains that EmptyString cannot be cast to Long . `getStruct(TableA)` is psuedo method that returns `struct` of TableA schema somehow. and `/path/to/more/data` have some value for this fields as an empty string. I think if reader doesnt complain about Empty string during schema inference it shouldn't complain either on reading without inference. May be treat Empty as Null just like during schema inference or at least give an additional option - treatEmptyAsNull so it's more explicit for application users? ps - i marked it as bug but could be more suited as improvements. was: Spark json reader can infer datatype of a fields. I am ingesting millions of datapoints and generating a `DataFrameA`. what i notice that Schema inference mark datatype of a field with tons of Integers and Empty Strings as a Long. That is an okay behavior as I don't set `primitivesAsString` cause I do want proper primitive type inference. I store `DataFrameA` into `TableA` Now, this infererence behavior is not respected by `fromJson` api when I am trying to write new data on `TableA` generated using my schema inference approach. Means, if I read a chunk of input data into using `spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` reader complains that EmptyString cannot be cast to Long . `getStruct(TableA)` is psuedo method that returns `struct` of TableA schema somehow. > Spark json datasource reader and fromJson api having inconsistent behavior > -- > > Key: SPARK-44991 > URL: https://issues.apache.org/jira/browse/SPARK-44991 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.2 >Reporter: nirav patel >Priority: Major > > Spark json reader can infer datatype of a fields. I am ingesting millions of > datapoints and generating a `DataFrameA`. what i notice that Schema > inference mark datatype of a field with tons of Integers and Empty Strings as > a Long. That is an okay behavior as I don't set `primitivesAsString` cause I > do want primitive type inference. I store `DataFrameA` into `TableA` > Now, this inference behavior is not respected by `fromJson` of `from_json` > api when I am trying to write new data on `TableA`. Means, if I read a chunk > of input data into using > `spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` > reader complains that EmptyString cannot be cast to Long . > `getStruct(TableA)` is psuedo method that returns `struct` of TableA schema > somehow. and `/path/to/more/data` have some value for this fields as an empty > string. > I think if reader doesnt complain about Empty string during schema inference > it shouldn't complain either on reading without inference. May be treat Empty > as Null just like during schema inference or at least give an additional > option - treatEmptyAsNull so it's more explicit for application users? > ps - i marked it as bug but could be more suited as improvements. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44991) Spark json datasource reader and fromJson api having inconsistent behavior
nirav patel created SPARK-44991: --- Summary: Spark json datasource reader and fromJson api having inconsistent behavior Key: SPARK-44991 URL: https://issues.apache.org/jira/browse/SPARK-44991 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.2 Reporter: nirav patel Spark json reader can infer datatype of a fields. I am ingesting millions of datapoints and generating a `DataFrameA`. what i notice that Schema inference mark datatype of a field with tons of Integers and Empty Strings as a Long. That is an okay behavior as I don't set `primitivesAsString` cause I do want proper primitive type inference. I store `DataFrameA` into `TableA` Now, this infererence behavior is not respected by `fromJson` api when I am trying to write new data on `TableA` generated using my schema inference approach. Means, if I read a chunk of input data into using `spark.read.schema(fromJson(getStruct(TableA)).json('/path/to/more/data')` reader complains that EmptyString cannot be cast to Long . `getStruct(TableA)` is psuedo method that returns `struct` of TableA schema somehow. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44989) Add a directional message to promote JIRA_ACCESS_TOKEN
[ https://issues.apache.org/jira/browse/SPARK-44989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44989: - Assignee: Dongjoon Hyun > Add a directional message to promote JIRA_ACCESS_TOKEN > -- > > Key: SPARK-44989 > URL: https://issues.apache.org/jira/browse/SPARK-44989 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44989) Add a directional message to promote JIRA_ACCESS_TOKEN
[ https://issues.apache.org/jira/browse/SPARK-44989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44989. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42704 [https://github.com/apache/spark/pull/42704] > Add a directional message to promote JIRA_ACCESS_TOKEN > -- > > Key: SPARK-44989 > URL: https://issues.apache.org/jira/browse/SPARK-44989 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44832) Fix connect client transitive classpath
[ https://issues.apache.org/jira/browse/SPARK-44832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-44832. --- Fix Version/s: 3.5.0 Assignee: Herman van Hövell Resolution: Fixed > Fix connect client transitive classpath > > > Key: SPARK-44832 > URL: https://issues.apache.org/jira/browse/SPARK-44832 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.5.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44989) Add a directional message to promote JIRA_ACCESS_TOKEN
Dongjoon Hyun created SPARK-44989: - Summary: Add a directional message to promote JIRA_ACCESS_TOKEN Key: SPARK-44989 URL: https://issues.apache.org/jira/browse/SPARK-44989 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44985) Use toString instead of stacktrace for task reaper threadDump
[ https://issues.apache.org/jira/browse/SPARK-44985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44985. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42699 [https://github.com/apache/spark/pull/42699] > Use toString instead of stacktrace for task reaper threadDump > - > > Key: SPARK-44985 > URL: https://issues.apache.org/jira/browse/SPARK-44985 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44985) Use toString instead of stacktrace for task reaper threadDump
[ https://issues.apache.org/jira/browse/SPARK-44985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44985: - Assignee: Kent Yao > Use toString instead of stacktrace for task reaper threadDump > - > > Key: SPARK-44985 > URL: https://issues.apache.org/jira/browse/SPARK-44985 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44972) Eagerly check if the token is valid to align with the behavior of username/password auth
[ https://issues.apache.org/jira/browse/SPARK-44972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44972: - Assignee: Kent Yao > Eagerly check if the token is valid to align with the behavior of > username/password auth > > > Key: SPARK-44972 > URL: https://issues.apache.org/jira/browse/SPARK-44972 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Kent Yao >Priority: Minor > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44972) Eagerly check if the token is valid to align with the behavior of username/password auth
[ https://issues.apache.org/jira/browse/SPARK-44972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44972. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42625 [https://github.com/apache/spark/pull/42625] > Eagerly check if the token is valid to align with the behavior of > username/password auth > > > Key: SPARK-44972 > URL: https://issues.apache.org/jira/browse/SPARK-44972 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44988) Parquet INT64 (TIMESTAMP(NANOS,false)) throwing Illegal Parquet type
Flavio Odas created SPARK-44988: --- Summary: Parquet INT64 (TIMESTAMP(NANOS,false)) throwing Illegal Parquet type Key: SPARK-44988 URL: https://issues.apache.org/jira/browse/SPARK-44988 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.1, 3.4.0 Reporter: Flavio Odas This bug seems similar to https://issues.apache.org/jira/browse/SPARK-40819, except that it's a problem with INT64 (TIMESTAMP(NANOS,false)), instead of INT64 (TIMESTAMP(NANOS,true)). The error happens whenever I'm trying to read: {code:java} org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false)). at org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1762) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:206) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertPrimitiveField$2(ParquetSchemaConverter.scala:283) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:224) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:187) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$3(ParquetSchemaConverter.scala:147) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$3$adapted(ParquetSchemaConverter.scala:117) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.immutable.Range.foreach(Range.scala:158) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertInternal(ParquetSchemaConverter.scala:117) at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:87) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:493) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:493) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:473) at scala.collection.immutable.Stream.map(Stream.scala:418) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:473) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:464) at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:79) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:853) at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:853) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true
[ https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759615#comment-17759615 ] Jakub Wozniak commented on SPARK-44805: --- Hello, Is it possible to know any ETA on this one? Is this something that could potentially be fixed in the next version of Spark or rather not? Thanks, Jakub > Data lost after union using > spark.sql.parquet.enableNestedColumnVectorizedReader=true > - > > Key: SPARK-44805 > URL: https://issues.apache.org/jira/browse/SPARK-44805 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.1 > Environment: pySpark, linux, hadoop, parquet. >Reporter: Jakub Wozniak >Priority: Major > > When union-ing two DataFrames read from parquet containing nested structures > (2 fields of array types where one is double and second is integer) data from > the second field seems to be lost (zeros are set instead). > This seems to be the case only if nested vectorised reader is used > (spark.sql.parquet.enableNestedColumnVectorizedReader=true). > The following Python code reproduces the problem: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.types import * > # PREPARING DATA > data1 = [] > data2 = [] > for i in range(2): > data1.append( (([1,2,3],[1,1,2]),i)) > data2.append( (([1.0,2.0,3.0],[1,1]),i+10)) > schema1 = StructType([ > StructField('value', StructType([ > StructField('f1', ArrayType(IntegerType()), True), > StructField('f2', ArrayType(IntegerType()), True) > ])), > StructField('id', IntegerType(), True) > ]) > schema2 = StructType([ > StructField('value', StructType([ > StructField('f1', ArrayType(DoubleType()), True), > StructField('f2', ArrayType(IntegerType()), True) > ])), > StructField('id', IntegerType(), True) > ]) > spark = SparkSession.builder.getOrCreate() > data_dir = "/user//" > df1 = spark.createDataFrame(data1, schema1) > df1.write.mode('overwrite').parquet(data_dir + "data1") > df2 = spark.createDataFrame(data2, schema2) > df2.write.mode('overwrite').parquet(data_dir + "data2") > # READING DATA > parquet1 = spark.read.parquet(data_dir + "data1") > parquet2 = spark.read.parquet(data_dir + "data2") > # UNION > out = parquet1.union(parquet2) > parquet1.select("value.f2").distinct().show() > out.select("value.f2").distinct().show() > print(parquet1.collect()) > print(out.collect()) {code} > Output: > {code:java} > +-+ > | f2| > +-+ > |[1, 1, 2]| > +-+ > +-+ > | f2| > +-+ > |[0, 0, 0]| > | [1, 1]| > +-+ > [ > Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), > Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1) > ] > [ > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11) > ] {code} > Please notice that values for the field f2 are lost after the union is done. > This only happens when this data is read from parquet files. > Could you please look into this? > Best regards, > Jakub -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44983) Convert binary to string by to_char for the formats: hex, base64, utf-8
[ https://issues.apache.org/jira/browse/SPARK-44983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-44983. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42632 [https://github.com/apache/spark/pull/42632] > Convert binary to string by to_char for the formats: hex, base64, utf-8 > --- > > Key: SPARK-44983 > URL: https://issues.apache.org/jira/browse/SPARK-44983 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 4.0.0 > > > Map the to_char() function with a binary input to one of hex(), base64(), > decode() to achieve feature parity with: > - Snowflake: https://docs.snowflake.com/en/sql-reference/functions/to_char > - SAP SQL Anywhere: > https://help.sap.com/docs/SAP_SQL_Anywhere/93079d4ba8e44920ae63ffb4def91f5b/81fe51196ce21014b9c6cf43b298.html > - Oracle: > https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/TO_CHAR-number.html#GUID-00DA076D-2468-41AB-A3AC-CC78DBA0D9CB > - Vertica: > https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Formatting/TO_CHAR.htm > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44974) Replace SparkSession/Dataset/KeyValueGroupedDataset with null during serialization
[ https://issues.apache.org/jira/browse/SPARK-44974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-44974. --- Fix Version/s: 3.5.0 Resolution: Fixed > Replace SparkSession/Dataset/KeyValueGroupedDataset with null during > serialization > -- > > Key: SPARK-44974 > URL: https://issues.apache.org/jira/browse/SPARK-44974 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.5.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44976) Preserve full principal user name on executor side
[ https://issues.apache.org/jira/browse/SPARK-44976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YUBI LEE updated SPARK-44976: - Summary: Preserve full principal user name on executor side (was: Utils.getCurrentUserName should return the full principal name) > Preserve full principal user name on executor side > -- > > Key: SPARK-44976 > URL: https://issues.apache.org/jira/browse/SPARK-44976 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.3, 3.3.3, 3.4.1 >Reporter: YUBI LEE >Priority: Major > > SPARK-6558 changes the behavior of {{Utils.getCurrentUserName()}} to use > shortname instead of full principal name. > Due to this, it doesn't respect {{hadoop.security.auth_to_local}} rule on the > side of non-kerberized hdfs namenode. > For example, I use 2 hdfs cluster. One is kerberized, the other one is not > kerberized. > I make a rule to add some prefix to username on the non-kerberized cluster if > some one access it from the kerberized cluster. > {code} > > hadoop.security.auth_to_local > > RULE:[1:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/ > RULE:[2:$1@$0](.*@EXAMPLE.COM)s/(.+)@.*/_ex_$1/ > DEFAULT > > {code} > However, if I submit spark job with keytab & principal option, hdfs directory > and files ownership is not coherent. > (I change some words for privacy.) > {code} > $ hdfs dfs -ls hdfs:///user/eub/some/path/20230510/23 > Found 52 items > -rw-rw-rw- 3 _ex_eub hdfs 0 2023-05-11 00:16 > hdfs:///user/eub/some/path/20230510/23/_SUCCESS > -rw-r--r-- 3 eub hdfs 134418857 2023-05-11 00:15 > hdfs:///user/eub/some/path/20230510/23/part-0-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz > -rw-r--r-- 3 eub hdfs 153410049 2023-05-11 00:16 > hdfs:///user/eub/some/path/20230510/23/part-1-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz > -rw-r--r-- 3 eub hdfs 157260989 2023-05-11 00:16 > hdfs:///user/eub/some/path/20230510/23/part-2-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz > -rw-r--r-- 3 eub hdfs 156222760 2023-05-11 00:16 > hdfs:///user/eub/some/path/20230510/23/part-3-b781be38-9dbc-41da-8d0e-597a7f343649-c000.txt.gz > {code} > Another interesting point is that if I submit spark job without keytab and > principal option but with kerberos authentication with {{kinit}}, it will not > follow {{hadoop.security.auth_to_local}} rule completely. > {code} > $ hdfs dfs -ls hdfs:///user/eub/output/ > Found 3 items > -rw-rw-r--+ 3 eub hdfs 0 2023-08-25 12:31 > hdfs:///user/eub/output/_SUCCESS > -rw-rw-r--+ 3 eub hdfs512 2023-08-25 12:31 > hdfs:///user/eub/output/part-0.gz > -rw-rw-r--+ 3 eub hdfs574 2023-08-25 12:31 > hdfs:///user/eub/output/part-1.gz > {code} > I finally found that if I submit spark job with {{--principal}} and > {{--keytab}} option, ugi will be different. > (refer to > https://github.com/apache/spark/blob/2583bd2c16a335747895c0843f438d0966f47ecd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L905). > Only file ({{_SUCCESS}}) and output directory created by driver (application > master side) will respect {{hadoop.security.auth_to_local}} on the > non-kerberized namenode only if {{--principal}} and {{--keytab}] options are > provided. > No matter how hdfs files or directory are created by executor or driver, > those should respect {{hadoop.security.auth_to_local}} rule and should be the > same. > Workaround is to pass additional argument to change {{SPARK_USER}} on the > executor side. > e.g. {{--conf spark.executorEnv.SPARK_USER=_ex_eub}} > {{--conf spark.yarn.appMasterEnv.SPARK_USER=_ex_eub}} will make an error. > There are some logics to append environment value with {{:}} (colon) as a > separator. > - > https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L893 > - > https://github.com/apache/spark/blob/4748d858b4478ea7503b792050d4735eae83b3cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L52 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44984) Remove _get_alias from DataFrame
[ https://issues.apache.org/jira/browse/SPARK-44984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-44984: - Assignee: Ruifeng Zheng > Remove _get_alias from DataFrame > > > Key: SPARK-44984 > URL: https://issues.apache.org/jira/browse/SPARK-44984 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44984) Remove _get_alias from DataFrame
[ https://issues.apache.org/jira/browse/SPARK-44984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-44984. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42698 [https://github.com/apache/spark/pull/42698] > Remove _get_alias from DataFrame > > > Key: SPARK-44984 > URL: https://issues.apache.org/jira/browse/SPARK-44984 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-42905) pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect and inconsistent results for the same DataFrame if it has huge amount of Ties.
[ https://issues.apache.org/jira/browse/SPARK-42905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757858#comment-17757858 ] zhangzhenhao edited comment on SPARK-42905 at 8/28/23 11:04 AM: minimal reproducible example. the result is incorrect and inconsistent when tied value size > 10_000_000 {code:java} import org.apache.spark.ml.linalg.{Matrix, Vectors, Vector} import org.apache.spark.ml.stat.Correlation import org.apache.spark.sql.Row val N = 1002 val x = sc.range(0, N).map(i => if (i < N - 1) 1.0 else 2.0) val y = sc.range(0, N).map(i => if (i < N - 1) 2.0 else 1.0) //val s1 = Statistics.corr(x, y, "spearman") val df = x.zip(y) .map{case (x, y) => Vectors.dense(x, y)} .map(Tuple1.apply) .repartition(1) .toDF("features") val Row(coeff1: Matrix) = Correlation.corr(df, "features", "spearman").head val r = coeff1(0, 1) println(s"spearman correlation in spark: $r") // spearman correlation in spark: -9.90476024495E-8 {code} the correct result is -1.0 was (Author: JIRAUSER301717): minimal reproducible example. the result is incorrect and inconsistent when tied value size > 10_000_000 {code:java} import org.apache.spark.ml.linalg.{Matrix, Vectors, Vector} import org.apache.spark.ml.stat.Correlation import org.apache.spark.sql.Row val N = 1002 val x = sc.range(0, N).map(i => if (i < N - 1) 1.0 else 2.0) val y = sc.range(0, N).map(i => if (i < N - 1) 2.0 else 1.0) //val s1 = Statistics.corr(x, y, "spearman") val df = x.zip(y) .map{case (x, y) => Vectors.dense(x, y)} .map(Tuple1.apply) .repartition(1) .toDF("features") val Row(coeff1: Matrix) = Correlation.corr(df, "features", "spearman").head val r = coeff1(0, 1) println(s"pearson correlation in spark: $r") // pearson correlation in spark: -9.90476024495E-8 {code} the correct result is -1.0 > pyspark.ml.stat.Correlation - Spearman Correlation method giving incorrect > and inconsistent results for the same DataFrame if it has huge amount of Ties. > - > > Key: SPARK-42905 > URL: https://issues.apache.org/jira/browse/SPARK-42905 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.3.0 >Reporter: dronzer >Priority: Critical > Labels: correctness > Attachments: image-2023-03-23-10-51-28-420.png, > image-2023-03-23-10-52-11-481.png, image-2023-03-23-10-52-49-392.png, > image-2023-03-23-10-53-37-461.png, image-2023-03-23-10-55-26-879.png > > > pyspark.ml.stat.Correlation > Following is the Scenario where the Correlation function fails for giving > correct Spearman Coefficient Results. > Tested E.g -> Spark DataFrame has 2 columns A and B. > !image-2023-03-23-10-55-26-879.png|width=562,height=162! > Column A has 3 Distinct Values and total of 108Million rows > Column B has 4 Distinct Values and total of 108Million rows > If I Calculate the correlation for this DataFrame in Python Pandas DF.corr, > it gives the correct answer even if i run the same code multiple times the > same answer is produced. (Each column has only 3-4 distinct values) > !image-2023-03-23-10-53-37-461.png|width=468,height=287! > > Coming to Spark and using Spearman Correlation produces a *different results* > for the *same dataframe* on multiple runs. (see below) (each column in this > df has only 3-4 distinct values) > !image-2023-03-23-10-52-49-392.png|width=516,height=322! > > Basically in python Pandas Df.corr it gives same results on same dataframe on > multiple runs which is expected behaviour. However, in Spark using the same > data it gives different result, moreover running the same cell with same data > multiple times produces different results meaning the output is inconsistent. > Coming to data the only observation I could conclude is Ties in data. (Only > 3-4 Distinct values over 108M Rows.) This scenario is not handled in Spark > Correlation method as the same data when used in python using df.corr > produces consistent results. > The only Workaround we could find to get consistent and the same output as > from python in Spark is by using Pandas UDF as shown below: > !image-2023-03-23-10-52-11-481.png|width=518,height=111! > !image-2023-03-23-10-51-28-420.png|width=509,height=270! > > We also tried pyspark.pandas.DataFrame .corr method and it produces incorrect > and inconsistent results for this case too. > Only PandasUDF seems to provide consistent results. > > Another point to note is : If i add some random noise to the data, which will > inturn increase the distinct values in the data. It again gives consistent > results for any runs. Which makes me believe that the Python version handles > ties co
[jira] [Updated] (SPARK-44987) Assign name to the error class _LEGACY_ERROR_TEMP_1100
[ https://issues.apache.org/jira/browse/SPARK-44987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-44987: - Description: Assign a name and improve the error message format. > Assign name to the error class _LEGACY_ERROR_TEMP_1100 > -- > > Key: SPARK-44987 > URL: https://issues.apache.org/jira/browse/SPARK-44987 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Minor > > Assign a name and improve the error message format. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44987) Assign name to the error class _LEGACY_ERROR_TEMP_1100
[ https://issues.apache.org/jira/browse/SPARK-44987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-44987: - Reporter: Max Gekk (was: BingKun Pan) > Assign name to the error class _LEGACY_ERROR_TEMP_1100 > -- > > Key: SPARK-44987 > URL: https://issues.apache.org/jira/browse/SPARK-44987 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44987) Assign name to the error class _LEGACY_ERROR_TEMP_1100
[ https://issues.apache.org/jira/browse/SPARK-44987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-44987: Assignee: Max Gekk > Assign name to the error class _LEGACY_ERROR_TEMP_1100 > -- > > Key: SPARK-44987 > URL: https://issues.apache.org/jira/browse/SPARK-44987 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44987) Assign name to the error class _LEGACY_ERROR_TEMP_1100[1017,1073,1074,1076,1125,1126]
Max Gekk created SPARK-44987: Summary: Assign name to the error class _LEGACY_ERROR_TEMP_1100[1017,1073,1074,1076,1125,1126] Key: SPARK-44987 URL: https://issues.apache.org/jira/browse/SPARK-44987 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44987) Assign name to the error class _LEGACY_ERROR_TEMP_1100
[ https://issues.apache.org/jira/browse/SPARK-44987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-44987: - Summary: Assign name to the error class _LEGACY_ERROR_TEMP_1100 (was: Assign name to the error class _LEGACY_ERROR_TEMP_1100[1017,1073,1074,1076,1125,1126]) > Assign name to the error class _LEGACY_ERROR_TEMP_1100 > -- > > Key: SPARK-44987 > URL: https://issues.apache.org/jira/browse/SPARK-44987 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44983) Convert binary to string by to_char for the formats: hex, base64, utf-8
[ https://issues.apache.org/jira/browse/SPARK-44983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759530#comment-17759530 ] Hudson commented on SPARK-44983: User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/42632 > Convert binary to string by to_char for the formats: hex, base64, utf-8 > --- > > Key: SPARK-44983 > URL: https://issues.apache.org/jira/browse/SPARK-44983 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Map the to_char() function with a binary input to one of hex(), base64(), > decode() to achieve feature parity with: > - Snowflake: https://docs.snowflake.com/en/sql-reference/functions/to_char > - SAP SQL Anywhere: > https://help.sap.com/docs/SAP_SQL_Anywhere/93079d4ba8e44920ae63ffb4def91f5b/81fe51196ce21014b9c6cf43b298.html > - Oracle: > https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/TO_CHAR-number.html#GUID-00DA076D-2468-41AB-A3AC-CC78DBA0D9CB > - Vertica: > https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Formatting/TO_CHAR.htm > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44986) There should be a gap at the bottom of the HTML
[ https://issues.apache.org/jira/browse/SPARK-44986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759520#comment-17759520 ] ASF GitHub Bot commented on SPARK-44986: User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/42702 > There should be a gap at the bottom of the HTML > --- > > Key: SPARK-44986 > URL: https://issues.apache.org/jira/browse/SPARK-44986 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Attachments: image-2023-08-28-16-46-04-705.png, > image-2023-08-28-16-47-11-582.png > > > Before: > !image-2023-08-28-16-47-11-582.png|width=794,height=392! > > After: > !image-2023-08-28-16-46-04-705.png|width=744,height=329! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44986) There should be a gap at the bottom of the HTML
[ https://issues.apache.org/jira/browse/SPARK-44986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-44986: Attachment: image-2023-08-28-16-47-11-582.png > There should be a gap at the bottom of the HTML > --- > > Key: SPARK-44986 > URL: https://issues.apache.org/jira/browse/SPARK-44986 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Attachments: image-2023-08-28-16-46-04-705.png, > image-2023-08-28-16-47-11-582.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44986) There should be a gap at the bottom of the HTML
[ https://issues.apache.org/jira/browse/SPARK-44986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-44986: Description: Before: !image-2023-08-28-16-47-11-582.png|width=794,height=392! After: !image-2023-08-28-16-46-04-705.png|width=744,height=329! > There should be a gap at the bottom of the HTML > --- > > Key: SPARK-44986 > URL: https://issues.apache.org/jira/browse/SPARK-44986 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Attachments: image-2023-08-28-16-46-04-705.png, > image-2023-08-28-16-47-11-582.png > > > Before: > !image-2023-08-28-16-47-11-582.png|width=794,height=392! > > After: > !image-2023-08-28-16-46-04-705.png|width=744,height=329! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44986) There should be a gap at the bottom of the HTML
[ https://issues.apache.org/jira/browse/SPARK-44986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-44986: Attachment: image-2023-08-28-16-46-04-705.png > There should be a gap at the bottom of the HTML > --- > > Key: SPARK-44986 > URL: https://issues.apache.org/jira/browse/SPARK-44986 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Attachments: image-2023-08-28-16-46-04-705.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44986) There should be a gap at the bottom of the HTML
BingKun Pan created SPARK-44986: --- Summary: There should be a gap at the bottom of the HTML Key: SPARK-44986 URL: https://issues.apache.org/jira/browse/SPARK-44986 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44982) Mark Spark Connect configurations as static configuration
[ https://issues.apache.org/jira/browse/SPARK-44982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44982. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42695 [https://github.com/apache/spark/pull/42695] > Mark Spark Connect configurations as static configuration > - > > Key: SPARK-44982 > URL: https://issues.apache.org/jira/browse/SPARK-44982 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Spark Connect server configurations are not marked either static or runtime > yet. We should mark them static. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44982) Mark Spark Connect configurations as static configuration
[ https://issues.apache.org/jira/browse/SPARK-44982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44982: Assignee: Hyukjin Kwon > Mark Spark Connect configurations as static configuration > - > > Key: SPARK-44982 > URL: https://issues.apache.org/jira/browse/SPARK-44982 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Spark Connect server configurations are not marked either static or runtime > yet. We should mark them static. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44981) Filter out static configurations used in local mode
[ https://issues.apache.org/jira/browse/SPARK-44981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-44981. -- Fix Version/s: 3.5.0 4.0.0 Resolution: Fixed Issue resolved by pull request 42694 [https://github.com/apache/spark/pull/42694] > Filter out static configurations used in local mode > --- > > Key: SPARK-44981 > URL: https://issues.apache.org/jira/browse/SPARK-44981 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.5.0, 4.0.0 > > > If you set a static configuration with `--remote local` mode, it shows a > bunch of warnings as below: > {code} > 23/08/28 11:39:42 ERROR ErrorUtils: Spark Connect RPC error during: config. > UserId: hyukjin.kwon. SessionId: 424674ef-af95-4b12-b10e-86479413f9fd. > org.apache.spark.sql.AnalysisException: Cannot modify the value of a static > config: spark.connect.copyFromLocalToFs.allowDestLocal. > at > org.apache.spark.sql.errors.QueryCompilationErrors$.cannotModifyValueOfStaticConfigError(QueryCompilationErrors.scala:3227) > at > org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:162) > at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42) > at > org.apache.spark.sql.connect.service.SparkConnectConfigHandler.$anonfun$handleSet$1(SparkConnectConfigHandler.scala:67) > at > org.apache.spark.sql.connect.service.SparkConnectConfigHandler.$anonfun$handleSet$1$adapted(SparkConnectConfigHandler.scala:65) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at > org.apache.spark.sql.connect.service.SparkConnectConfigHandler.handleSet(SparkConnectConfigHandler.scala:65) > at > org.apache.spark.sql.connect.service.SparkConnectConfigHandler.handle(SparkConnectConfigHandler.scala:40) > at > org.apache.spark.sql.connect.service.SparkConnectService.config(SparkConnectService.scala:120) > at > org.apache.spark.connect.proto.SparkConnectServiceGrpc$MethodHandlers.invoke(SparkConnectServiceGrpc.java:751) > at > org.sparkproject.connect.grpc.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) > at > org.sparkproject.connect.grpc.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:346) > at > org.sparkproject.connect.grpc.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:860) > at > org.sparkproject.connect.grpc.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) > at > org.sparkproject.connect.grpc.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44981) Filter out static configurations used in local mode
[ https://issues.apache.org/jira/browse/SPARK-44981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-44981: Assignee: Hyukjin Kwon > Filter out static configurations used in local mode > --- > > Key: SPARK-44981 > URL: https://issues.apache.org/jira/browse/SPARK-44981 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > If you set a static configuration with `--remote local` mode, it shows a > bunch of warnings as below: > {code} > 23/08/28 11:39:42 ERROR ErrorUtils: Spark Connect RPC error during: config. > UserId: hyukjin.kwon. SessionId: 424674ef-af95-4b12-b10e-86479413f9fd. > org.apache.spark.sql.AnalysisException: Cannot modify the value of a static > config: spark.connect.copyFromLocalToFs.allowDestLocal. > at > org.apache.spark.sql.errors.QueryCompilationErrors$.cannotModifyValueOfStaticConfigError(QueryCompilationErrors.scala:3227) > at > org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:162) > at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42) > at > org.apache.spark.sql.connect.service.SparkConnectConfigHandler.$anonfun$handleSet$1(SparkConnectConfigHandler.scala:67) > at > org.apache.spark.sql.connect.service.SparkConnectConfigHandler.$anonfun$handleSet$1$adapted(SparkConnectConfigHandler.scala:65) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at > org.apache.spark.sql.connect.service.SparkConnectConfigHandler.handleSet(SparkConnectConfigHandler.scala:65) > at > org.apache.spark.sql.connect.service.SparkConnectConfigHandler.handle(SparkConnectConfigHandler.scala:40) > at > org.apache.spark.sql.connect.service.SparkConnectService.config(SparkConnectService.scala:120) > at > org.apache.spark.connect.proto.SparkConnectServiceGrpc$MethodHandlers.invoke(SparkConnectServiceGrpc.java:751) > at > org.sparkproject.connect.grpc.io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) > at > org.sparkproject.connect.grpc.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:346) > at > org.sparkproject.connect.grpc.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:860) > at > org.sparkproject.connect.grpc.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) > at > org.sparkproject.connect.grpc.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44985) Use toString instead of stacktrace for task reaper threadDump
Kent Yao created SPARK-44985: Summary: Use toString instead of stacktrace for task reaper threadDump Key: SPARK-44985 URL: https://issues.apache.org/jira/browse/SPARK-44985 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44984) Remove _get_alias from DataFrame
Ruifeng Zheng created SPARK-44984: - Summary: Remove _get_alias from DataFrame Key: SPARK-44984 URL: https://issues.apache.org/jira/browse/SPARK-44984 Project: Spark Issue Type: Improvement Components: Connect, PySpark Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44819) Make Python the first language in all Spark code snippet
[ https://issues.apache.org/jira/browse/SPARK-44819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759468#comment-17759468 ] BingKun Pan edited comment on SPARK-44819 at 8/28/23 7:36 AM: -- This PR is duplicated with `https://issues.apache.org/jira/browse/SPARK-42642`, I think we can close this. was (Author: panbingkun): I work on it. > Make Python the first language in all Spark code snippet > > > Key: SPARK-44819 > URL: https://issues.apache.org/jira/browse/SPARK-44819 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > Attachments: Screenshot 2023-08-15 at 11.59.11.png > > > Currently, the first and default language for all code snippets is Sacla. For > instance: https://spark.apache.org/docs/latest/quick-start.html > We should make Python the first language for all the code snippets. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44983) Convert binary to string by to_char for the formats: hex, base64, utf-8
Max Gekk created SPARK-44983: Summary: Convert binary to string by to_char for the formats: hex, base64, utf-8 Key: SPARK-44983 URL: https://issues.apache.org/jira/browse/SPARK-44983 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 4.0.0 Reporter: Max Gekk Assignee: Max Gekk Map the to_char() function with a binary input to one of hex(), base64(), decode() to achieve feature parity with: - Snowflake: https://docs.snowflake.com/en/sql-reference/functions/to_char - SAP SQL Anywhere: https://help.sap.com/docs/SAP_SQL_Anywhere/93079d4ba8e44920ae63ffb4def91f5b/81fe51196ce21014b9c6cf43b298.html - Oracle: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/TO_CHAR-number.html#GUID-00DA076D-2468-41AB-A3AC-CC78DBA0D9CB - Vertica: https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Formatting/TO_CHAR.htm -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org