[jira] [Updated] (SPARK-38584) Unify the data validation
[ https://issues.apache.org/jira/browse/SPARK-38584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-38584: - Description: 1, input vector validation is missing in most algorithms, when the input dataset contains some invalid values (NaN/Infinity), then: * the training may run successfuly and return model invalid coefficients, like LinearSVC * the training will fail with irrelevant message, like KMeans {code:java} import org.apache.spark.ml.feature._ import org.apache.spark.ml.linalg._ import org.apache.spark.ml.classification._ import org.apache.spark.ml.clustering._ val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0.toDF() val svc = new LinearSVC() val model = svc.fit(df) scala> model.intercept res0: Double = NaN scala> model.coefficients res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] val km = new KMeans().setK(2) scala> km.fit(df) 22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 113) java.lang.IllegalArgumentException: requirement failed: Both norms should be greater or equal to 0.0, found norm1=NaN, norm2=Infinity at scala.Predef$.require(Predef.scala:281) at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543) {code} 2, relative methods to validate input dataset (like labels/weights) exists in {{{}org.apache.spark.ml.functions{}}}, org.apache.spark.ml.util.DatasetUtils, org.apache.spark.ml.util.MetadataUtils, etc. I think it is time to unify realtive methods to one source file. was: 1, input vector validation is missing in most algorithms, when the input dataset contains some invalid values (NaN/Infinity), then: * the training may run successfuly with invalid model, like LinearSVC * the training will fail with irrelevant message, like KMeans {code:java} import org.apache.spark.ml.feature._ import org.apache.spark.ml.linalg._ import org.apache.spark.ml.classification._ import org.apache.spark.ml.clustering._ val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0.toDF() val svc = new LinearSVC() val model = svc.fit(df) scala> model.intercept res0: Double = NaN scala> model.coefficients res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] val km = new KMeans().setK(2) scala> km.fit(df) 22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 113) java.lang.IllegalArgumentException: requirement failed: Both norms should be greater or equal to 0.0, found norm1=NaN, norm2=Infinity at scala.Predef$.require(Predef.scala:281) at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543) {code} 2, relative methods to validate input dataset (like labels/weights) exists in {{{}org.apache.spark.ml.functions{}}}, org.apache.spark.ml.util.DatasetUtils, org.apache.spark.ml.util.MetadataUtils, etc. I think it is time to unify realtive methods to one source file. > Unify the data validation > - > > Key: SPARK-38584 > URL: https://issues.apache.org/jira/browse/SPARK-38584 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.4.0 >Reporter: zhengruifeng >Priority: Major > > 1, input vector validation is missing in most algorithms, when the input > dataset contains some invalid values (NaN/Infinity), then: > * the training may run successfuly and return model invalid coefficients, > like LinearSVC > * the training will fail with irrelevant message, like KMeans > > {code:java} > import org.apache.spark.ml.feature._ > import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.classification._ > import org.apache.spark.ml.clustering._ > val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, > Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, > 2.0.toDF() > val svc = new LinearSVC() > val model = svc.fit(df) > scala> model.intercept > res0: Double = NaN > scala> model.coefficients > res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] > val km = new KMeans().setK(2) > scala> km.fit(df) > 22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID > 113) > java.lang.IllegalArgumentException: requirement failed: Both norms should be > greater or equal to 0.0, found norm1=NaN, norm2=Infinity > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543) > {code} > > 2, relative methods to validate input dataset (like labels/weights) exists in > {{{}org.apache.spark.ml.functions{}}}, org.apache.spark.ml.util.DatasetUtils, > org.apache.spark.ml.util.MetadataUtils, etc. > > I think it is time to unify realtive methods t
[jira] [Created] (SPARK-38584) Unify the data validation
zhengruifeng created SPARK-38584: Summary: Unify the data validation Key: SPARK-38584 URL: https://issues.apache.org/jira/browse/SPARK-38584 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.4.0 Reporter: zhengruifeng 1, input vector validation is missing in most algorithms, when the input dataset contains some invalid values (NaN/Infinity), then: * the training may run successfuly with invalid model, like LinearSVC * the training will fail with irrelevant message, like KMeans {code:java} import org.apache.spark.ml.feature._ import org.apache.spark.ml.linalg._ import org.apache.spark.ml.classification._ import org.apache.spark.ml.clustering._ val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0.toDF() val svc = new LinearSVC() val model = svc.fit(df) scala> model.intercept res0: Double = NaN scala> model.coefficients res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] val km = new KMeans().setK(2) scala> km.fit(df) 22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 113) java.lang.IllegalArgumentException: requirement failed: Both norms should be greater or equal to 0.0, found norm1=NaN, norm2=Infinity at scala.Predef$.require(Predef.scala:281) at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543) {code} 2, relative methods to validate input dataset (like labels/weights) exists in {{{}org.apache.spark.ml.functions{}}}, org.apache.spark.ml.util.DatasetUtils, org.apache.spark.ml.util.MetadataUtils, etc. I think it is time to unify realtive methods to one source file. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38582) Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for `KubernetesUtils` to eliminate duplicate code pattern
[ https://issues.apache.org/jira/browse/SPARK-38582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38582: Assignee: (was: Apache Spark) > Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for > `KubernetesUtils` to eliminate duplicate code pattern > --- > > Key: SPARK-38582 > URL: https://issues.apache.org/jira/browse/SPARK-38582 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.1 >Reporter: qian >Priority: Minor > > There are many duplicate code patterns in Spark Code: > {code:java} > new EnvVarBuilder() > .withName(key) > .withValue(value) > .build() {code} > {code:java} > new EnvVarBuilder() >.withName(name) > .withValueFrom(new EnvVarSourceBuilder() >.withNewFieldRef(version, field) >.build()) >.build() > {code} > > [The assignment statement for executor envVar | > https://github.com/apache/spark/blob/branch-3.3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L123-L185] > has 63 lines. We could introduce _buildEnvVarsWithKV_ and > _buildEnvVarsWithFieldRef_ function to simplify the above code patterns. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38582) Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for `KubernetesUtils` to eliminate duplicate code pattern
[ https://issues.apache.org/jira/browse/SPARK-38582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507996#comment-17507996 ] Apache Spark commented on SPARK-38582: -- User 'dcoliversun' has created a pull request for this issue: https://github.com/apache/spark/pull/35886 > Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for > `KubernetesUtils` to eliminate duplicate code pattern > --- > > Key: SPARK-38582 > URL: https://issues.apache.org/jira/browse/SPARK-38582 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.1 >Reporter: qian >Priority: Minor > > There are many duplicate code patterns in Spark Code: > {code:java} > new EnvVarBuilder() > .withName(key) > .withValue(value) > .build() {code} > {code:java} > new EnvVarBuilder() >.withName(name) > .withValueFrom(new EnvVarSourceBuilder() >.withNewFieldRef(version, field) >.build()) >.build() > {code} > > [The assignment statement for executor envVar | > https://github.com/apache/spark/blob/branch-3.3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L123-L185] > has 63 lines. We could introduce _buildEnvVarsWithKV_ and > _buildEnvVarsWithFieldRef_ function to simplify the above code patterns. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38582) Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for `KubernetesUtils` to eliminate duplicate code pattern
[ https://issues.apache.org/jira/browse/SPARK-38582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507995#comment-17507995 ] Apache Spark commented on SPARK-38582: -- User 'dcoliversun' has created a pull request for this issue: https://github.com/apache/spark/pull/35886 > Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for > `KubernetesUtils` to eliminate duplicate code pattern > --- > > Key: SPARK-38582 > URL: https://issues.apache.org/jira/browse/SPARK-38582 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.1 >Reporter: qian >Priority: Minor > > There are many duplicate code patterns in Spark Code: > {code:java} > new EnvVarBuilder() > .withName(key) > .withValue(value) > .build() {code} > {code:java} > new EnvVarBuilder() >.withName(name) > .withValueFrom(new EnvVarSourceBuilder() >.withNewFieldRef(version, field) >.build()) >.build() > {code} > > [The assignment statement for executor envVar | > https://github.com/apache/spark/blob/branch-3.3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L123-L185] > has 63 lines. We could introduce _buildEnvVarsWithKV_ and > _buildEnvVarsWithFieldRef_ function to simplify the above code patterns. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38582) Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for `KubernetesUtils` to eliminate duplicate code pattern
[ https://issues.apache.org/jira/browse/SPARK-38582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38582: Assignee: Apache Spark > Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for > `KubernetesUtils` to eliminate duplicate code pattern > --- > > Key: SPARK-38582 > URL: https://issues.apache.org/jira/browse/SPARK-38582 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.1 >Reporter: qian >Assignee: Apache Spark >Priority: Minor > > There are many duplicate code patterns in Spark Code: > {code:java} > new EnvVarBuilder() > .withName(key) > .withValue(value) > .build() {code} > {code:java} > new EnvVarBuilder() >.withName(name) > .withValueFrom(new EnvVarSourceBuilder() >.withNewFieldRef(version, field) >.build()) >.build() > {code} > > [The assignment statement for executor envVar | > https://github.com/apache/spark/blob/branch-3.3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L123-L185] > has 63 lines. We could introduce _buildEnvVarsWithKV_ and > _buildEnvVarsWithFieldRef_ function to simplify the above code patterns. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38566) Revert the parser changes for DEFAULT column support
[ https://issues.apache.org/jira/browse/SPARK-38566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507994#comment-17507994 ] Apache Spark commented on SPARK-38566: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/35885 > Revert the parser changes for DEFAULT column support > > > Key: SPARK-38566 > URL: https://issues.apache.org/jira/browse/SPARK-38566 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Blocker > > Revert the commit > https://github.com/apache/spark/commit/e21cb62d02c85a66771822cdd49c49dbb3e44502 > from branch-3.3. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38566) Revert the parser changes for DEFAULT column support
[ https://issues.apache.org/jira/browse/SPARK-38566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507993#comment-17507993 ] Apache Spark commented on SPARK-38566: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/35885 > Revert the parser changes for DEFAULT column support > > > Key: SPARK-38566 > URL: https://issues.apache.org/jira/browse/SPARK-38566 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Blocker > > Revert the commit > https://github.com/apache/spark/commit/e21cb62d02c85a66771822cdd49c49dbb3e44502 > from branch-3.3. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38583) to_timestamp should allow numeric types
Hyukjin Kwon created SPARK-38583: Summary: to_timestamp should allow numeric types Key: SPARK-38583 URL: https://issues.apache.org/jira/browse/SPARK-38583 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon SPARK-38240 mistakenly disallowed numeric type at to_timestamp. We should allow it back: {code} spark.range(1).selectExpr("to_timestamp(id)").show() {code} **Before** {code} +---+ | to_timestamp(id)| +---+ |1970-01-01 09:00:00| +---+ {code} **After** {code} +-+ | to_timestamp(id)| +-+ | null| +-+ {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38583) to_timestamp should allow numeric types
[ https://issues.apache.org/jira/browse/SPARK-38583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-38583: - Description: SPARK-38240 mistakenly disallowed numeric type at to_timestamp. We should allow it back: {code} spark.range(1).selectExpr("to_timestamp(id)").show() {code} *Before* {code} +---+ | to_timestamp(id)| +---+ |1970-01-01 09:00:00| +---+ {code} *After* {code} +-+ | to_timestamp(id)| +-+ | null| +-+ {code} was: SPARK-38240 mistakenly disallowed numeric type at to_timestamp. We should allow it back: {code} spark.range(1).selectExpr("to_timestamp(id)").show() {code} **Before** {code} +---+ | to_timestamp(id)| +---+ |1970-01-01 09:00:00| +---+ {code} **After** {code} +-+ | to_timestamp(id)| +-+ | null| +-+ {code} > to_timestamp should allow numeric types > --- > > Key: SPARK-38583 > URL: https://issues.apache.org/jira/browse/SPARK-38583 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > SPARK-38240 mistakenly disallowed numeric type at to_timestamp. We should > allow it back: > {code} > spark.range(1).selectExpr("to_timestamp(id)").show() > {code} > *Before* > {code} > +---+ > | to_timestamp(id)| > +---+ > |1970-01-01 09:00:00| > +---+ > {code} > *After* > {code} > +-+ > | to_timestamp(id)| > +-+ > | null| > +-+ > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38582) Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for `KubernetesUtils` to eliminate duplicate code pattern
qian created SPARK-38582: Summary: Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for `KubernetesUtils` to eliminate duplicate code pattern Key: SPARK-38582 URL: https://issues.apache.org/jira/browse/SPARK-38582 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.2.1 Reporter: qian There are many duplicate code patterns in Spark Code: {code:java} new EnvVarBuilder() .withName(key) .withValue(value) .build() {code} {code:java} new EnvVarBuilder() .withName(name) .withValueFrom(new EnvVarSourceBuilder() .withNewFieldRef(version, field) .build()) .build() {code} [The assignment statement for executor envVar | https://github.com/apache/spark/blob/branch-3.3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L123-L185] has 63 lines. We could introduce _buildEnvVarsWithKV_ and _buildEnvVarsWithFieldRef_ function to simplify the above code patterns. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38557) What may be a cause for HDFSMetadataCommitter: Error while fetching MetaData and how to fix or work around this?
[ https://issues.apache.org/jira/browse/SPARK-38557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38557. -- Resolution: Invalid > What may be a cause for HDFSMetadataCommitter: Error while fetching MetaData > and how to fix or work around this? > > > Key: SPARK-38557 > URL: https://issues.apache.org/jira/browse/SPARK-38557 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 3.1.1 > Environment: Spark 3.1.1 > AWS EMR 6.3.0 > python 3.7.2 >Reporter: Dmitry Goldenberg >Priority: Major > > I'm seeing errors such as the below when executing structured Spark Streaming > app which streams data from AWS Kinesis. > > I've googled the error but can't tell what may be the cause. Is Spark running > out of disk space? something else? > {code:java} > // From the stderr log in EMR > 22/03/15 00:54:00 WARN HDFSMetadataCommitter: Error while fetching MetaData > [attempt = 1] > java.lang.IllegalStateException: > hdfs://ip-10-2-XXX-XXX.awsinternal.acme.com:8020/mnt/tmp/temporary-03b8fecf-32d5-422c-9375-4c3450ed0bb8/sources/0/shard-commit/0 > does not exist > at > org.apache.spark.sql.kinesis.HDFSMetadataCommitter.$anonfun$get$1(HDFSMetadataCommitter.scala:163) > at > org.apache.spark.sql.kinesis.HDFSMetadataCommitter.withRetry(HDFSMetadataCommitter.scala:229) > at > org.apache.spark.sql.kinesis.HDFSMetadataCommitter.get(HDFSMetadataCommitter.scala:151) > at > org.apache.spark.sql.kinesis.KinesisSource.prevBatchShardInfo(KinesisSource.scala:275) > at > org.apache.spark.sql.kinesis.KinesisSource.getOffset(KinesisSource.scala:163) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$6(MicroBatchExecution.scala:399) > at > org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357) > at > org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:399) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.Map$Map1.foreach(Map.scala:128) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:382) > at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:613) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:378) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:211) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357) > at > org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:194) > at > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:188) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:333) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:244){code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38557) What may be a cause for HDFSMetadataCommitter: Error while fetching MetaData and how to fix or work around this?
[ https://issues.apache.org/jira/browse/SPARK-38557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507976#comment-17507976 ] Hyukjin Kwon commented on SPARK-38557: -- For questions, we should better interact with Spark mailing list. Let's file an issue after having a discussion there first. > What may be a cause for HDFSMetadataCommitter: Error while fetching MetaData > and how to fix or work around this? > > > Key: SPARK-38557 > URL: https://issues.apache.org/jira/browse/SPARK-38557 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 3.1.1 > Environment: Spark 3.1.1 > AWS EMR 6.3.0 > python 3.7.2 >Reporter: Dmitry Goldenberg >Priority: Major > > I'm seeing errors such as the below when executing structured Spark Streaming > app which streams data from AWS Kinesis. > > I've googled the error but can't tell what may be the cause. Is Spark running > out of disk space? something else? > {code:java} > // From the stderr log in EMR > 22/03/15 00:54:00 WARN HDFSMetadataCommitter: Error while fetching MetaData > [attempt = 1] > java.lang.IllegalStateException: > hdfs://ip-10-2-XXX-XXX.awsinternal.acme.com:8020/mnt/tmp/temporary-03b8fecf-32d5-422c-9375-4c3450ed0bb8/sources/0/shard-commit/0 > does not exist > at > org.apache.spark.sql.kinesis.HDFSMetadataCommitter.$anonfun$get$1(HDFSMetadataCommitter.scala:163) > at > org.apache.spark.sql.kinesis.HDFSMetadataCommitter.withRetry(HDFSMetadataCommitter.scala:229) > at > org.apache.spark.sql.kinesis.HDFSMetadataCommitter.get(HDFSMetadataCommitter.scala:151) > at > org.apache.spark.sql.kinesis.KinesisSource.prevBatchShardInfo(KinesisSource.scala:275) > at > org.apache.spark.sql.kinesis.KinesisSource.getOffset(KinesisSource.scala:163) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$6(MicroBatchExecution.scala:399) > at > org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357) > at > org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:399) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.immutable.Map$Map1.foreach(Map.scala:128) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:382) > at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:613) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:378) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:211) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357) > at > org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:194) > at > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:188) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:333) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:244){code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38578) Avoid unnecessary sort in FileFormatWriter if user has specified sort in AQE
[ https://issues.apache.org/jira/browse/SPARK-38578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-38578: -- Summary: Avoid unnecessary sort in FileFormatWriter if user has specified sort in AQE (was: Avoid unnecessary sort in FileFormatWriter if user has specified sort) > Avoid unnecessary sort in FileFormatWriter if user has specified sort in AQE > > > Key: SPARK-38578 > URL: https://issues.apache.org/jira/browse/SPARK-38578 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > FileFormatWriter will check and add an implicit sort for dynamic partition > columns or bucket columns according to the input physical plan. The check > became always failure since AQE AdaptiveSparkPlanExec has no outputOrdering. > That casues a redundant sort if user has specified a sort which satisfies the > required ordering (dynamic partition and bucket columns). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38581) List of supported pandas APIs for pandas API on Spark docs.
[ https://issues.apache.org/jira/browse/SPARK-38581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-38581: Issue Type: Documentation (was: Bug) > List of supported pandas APIs for pandas API on Spark docs. > --- > > Key: SPARK-38581 > URL: https://issues.apache.org/jira/browse/SPARK-38581 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > In Modin, they have a [list of supported pandas > API|https://modin.readthedocs.io/en/stable/supported_apis/dataframe_supported.html]. > It would be great if we have also supported pandas API list so that users can > easily find which API is available or not in pandas API on Spark -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38581) List of supported pandas APIs for pandas API on Spark docs.
[ https://issues.apache.org/jira/browse/SPARK-38581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507962#comment-17507962 ] Haejoon Lee commented on SPARK-38581: - I'm working on it > List of supported pandas APIs for pandas API on Spark docs. > --- > > Key: SPARK-38581 > URL: https://issues.apache.org/jira/browse/SPARK-38581 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > In Modin, they have a [list of supported pandas > API|https://modin.readthedocs.io/en/stable/supported_apis/dataframe_supported.html]. > It would be great if we have also supported pandas API list so that users can > easily find which API is available or not in pandas API on Spark -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38581) List of supported pandas APIs for pandas API on Spark docs.
Haejoon Lee created SPARK-38581: --- Summary: List of supported pandas APIs for pandas API on Spark docs. Key: SPARK-38581 URL: https://issues.apache.org/jira/browse/SPARK-38581 Project: Spark Issue Type: Bug Components: Documentation, PySpark Affects Versions: 3.3.0 Reporter: Haejoon Lee In Modin, they have a [list of supported pandas API|https://modin.readthedocs.io/en/stable/supported_apis/dataframe_supported.html]. It would be great if we have also supported pandas API list so that users can easily find which API is available or not in pandas API on Spark -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38579) Requesting Restful API can cause NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-38579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507958#comment-17507958 ] Apache Spark commented on SPARK-38579: -- User 'yym1995' has created a pull request for this issue: https://github.com/apache/spark/pull/35884 > Requesting Restful API can cause NullPointerException > - > > Key: SPARK-38579 > URL: https://issues.apache.org/jira/browse/SPARK-38579 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.2.1 >Reporter: Yimin Yang >Priority: Major > > When requesting Restful API > \{baseURL}/api/v1/applications/$appId/sql/$executionId which is introduced by > this PR [https://github.com/apache/spark/pull/28208,] it can cause > NullPointerException. The root cause is, when calling method doUpdate() of > `LiveExecutionData`, `metricsValues` can be null. Then, when statement > `printableMetrics(graph.allNodes, exec.metricValues)` is executed, it will > throw NullPointerException. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38579) Requesting Restful API can cause NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-38579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38579: Assignee: Apache Spark > Requesting Restful API can cause NullPointerException > - > > Key: SPARK-38579 > URL: https://issues.apache.org/jira/browse/SPARK-38579 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.2.1 >Reporter: Yimin Yang >Assignee: Apache Spark >Priority: Major > > When requesting Restful API > \{baseURL}/api/v1/applications/$appId/sql/$executionId which is introduced by > this PR [https://github.com/apache/spark/pull/28208,] it can cause > NullPointerException. The root cause is, when calling method doUpdate() of > `LiveExecutionData`, `metricsValues` can be null. Then, when statement > `printableMetrics(graph.allNodes, exec.metricValues)` is executed, it will > throw NullPointerException. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38579) Requesting Restful API can cause NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-38579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38579: Assignee: (was: Apache Spark) > Requesting Restful API can cause NullPointerException > - > > Key: SPARK-38579 > URL: https://issues.apache.org/jira/browse/SPARK-38579 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.2.1 >Reporter: Yimin Yang >Priority: Major > > When requesting Restful API > \{baseURL}/api/v1/applications/$appId/sql/$executionId which is introduced by > this PR [https://github.com/apache/spark/pull/28208,] it can cause > NullPointerException. The root cause is, when calling method doUpdate() of > `LiveExecutionData`, `metricsValues` can be null. Then, when statement > `printableMetrics(graph.allNodes, exec.metricValues)` is executed, it will > throw NullPointerException. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38580) Requesting Restful API can cause NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-38580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yimin Yang resolved SPARK-38580. Resolution: Duplicate > Requesting Restful API can cause NullPointerException > - > > Key: SPARK-38580 > URL: https://issues.apache.org/jira/browse/SPARK-38580 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.2.1 >Reporter: Yimin Yang >Priority: Major > > When requesting Restful API > \{baseURL}/api/v1/applications/$appId/sql/$executionId which is introduced by > this PR [https://github.com/apache/spark/pull/28208,] it can cause > NullPointerException. The root cause is, when calling method doUpdate() of > `LiveExecutionData`, `metricsValues` can be null. Then, when statement > `printableMetrics(graph.allNodes, exec.metricValues)` is executed, it will > throw NullPointerException. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38580) Requesting Restful API can cause NullPointerException
Yimin Yang created SPARK-38580: -- Summary: Requesting Restful API can cause NullPointerException Key: SPARK-38580 URL: https://issues.apache.org/jira/browse/SPARK-38580 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.2.1, 3.2.0, 3.1.2, 3.1.1, 3.1.0 Reporter: Yimin Yang When requesting Restful API \{baseURL}/api/v1/applications/$appId/sql/$executionId which is introduced by this PR [https://github.com/apache/spark/pull/28208,] it can cause NullPointerException. The root cause is, when calling method doUpdate() of `LiveExecutionData`, `metricsValues` can be null. Then, when statement `printableMetrics(graph.allNodes, exec.metricValues)` is executed, it will throw NullPointerException. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38579) Requesting Restful API can cause NullPointerException
Yimin Yang created SPARK-38579: -- Summary: Requesting Restful API can cause NullPointerException Key: SPARK-38579 URL: https://issues.apache.org/jira/browse/SPARK-38579 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.2.1, 3.2.0, 3.1.2, 3.1.1, 3.1.0 Reporter: Yimin Yang When requesting Restful API \{baseURL}/api/v1/applications/$appId/sql/$executionId which is introduced by this PR [https://github.com/apache/spark/pull/28208,] it can cause NullPointerException. The root cause is, when calling method doUpdate() of `LiveExecutionData`, `metricsValues` can be null. Then, when statement `printableMetrics(graph.allNodes, exec.metricValues)` is executed, it will throw NullPointerException. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38575) Duduplicate branch specification in GitHub Actions workflow
[ https://issues.apache.org/jira/browse/SPARK-38575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507946#comment-17507946 ] Apache Spark commented on SPARK-38575: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/35883 > Duduplicate branch specification in GitHub Actions workflow > --- > > Key: SPARK-38575 > URL: https://issues.apache.org/jira/browse/SPARK-38575 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > Currently we should make some changes everytime we make branch like > https://github.com/apache/spark/pull/35876. We should ideally make it > automatically working without making such change. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38578) Avoid unnecessary sort in FileFormatWriter if user has specified sort
[ https://issues.apache.org/jira/browse/SPARK-38578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-38578: -- Parent: SPARK-37063 Issue Type: Sub-task (was: Improvement) > Avoid unnecessary sort in FileFormatWriter if user has specified sort > - > > Key: SPARK-38578 > URL: https://issues.apache.org/jira/browse/SPARK-38578 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > FileFormatWriter will check and add an implicit sort for dynamic partition > columns or bucket columns according to the input physical plan. The check > became always failure since AQE AdaptiveSparkPlanExec has no outputOrdering. > That casues a redundant sort if user has specified a sort which satisfies the > required ordering (dynamic partition and bucket columns). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38578) Avoid unnecessary sort in FileFormatWriter if user has specified sort
XiDuo You created SPARK-38578: - Summary: Avoid unnecessary sort in FileFormatWriter if user has specified sort Key: SPARK-38578 URL: https://issues.apache.org/jira/browse/SPARK-38578 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: XiDuo You FileFormatWriter will check and add an implicit sort for dynamic partition columns or bucket columns according to the input physical plan. The check became always failure since AQE AdaptiveSparkPlanExec has no outputOrdering. That casues a redundant sort if user has specified a sort which satisfies the required ordering (dynamic partition and bucket columns). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38575) Duduplicate branch specification in GitHub Actions workflow
[ https://issues.apache.org/jira/browse/SPARK-38575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-38575: Assignee: Hyukjin Kwon > Duduplicate branch specification in GitHub Actions workflow > --- > > Key: SPARK-38575 > URL: https://issues.apache.org/jira/browse/SPARK-38575 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Currently we should make some changes everytime we make branch like > https://github.com/apache/spark/pull/35876. We should ideally make it > automatically working without making such change. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38575) Duduplicate branch specification in GitHub Actions workflow
[ https://issues.apache.org/jira/browse/SPARK-38575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38575. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 35882 [https://github.com/apache/spark/pull/35882] > Duduplicate branch specification in GitHub Actions workflow > --- > > Key: SPARK-38575 > URL: https://issues.apache.org/jira/browse/SPARK-38575 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > Currently we should make some changes everytime we make branch like > https://github.com/apache/spark/pull/35876. We should ideally make it > automatically working without making such change. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2
[ https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-38568: Reporter: Yuming Wang (was: Dongjoon Hyun) > Upgrade ZSTD-JNI to 1.5.2-2 > --- > > Key: SPARK-38568 > URL: https://issues.apache.org/jira/browse/SPARK-38568 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2
[ https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507942#comment-17507942 ] Yuming Wang commented on SPARK-38568: - OK > Upgrade ZSTD-JNI to 1.5.2-2 > --- > > Key: SPARK-38568 > URL: https://issues.apache.org/jira/browse/SPARK-38568 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38577) Interval types are not truncated to the expected endField when creating a DataFrame via Duration
[ https://issues.apache.org/jira/browse/SPARK-38577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chong updated SPARK-38577: -- Description: *Problem:* ANSI interval types are store as long internally. The long value are not truncated to the expected endField when creating a DataFrame via Duration. *Reproduce:* Create a "day to day" interval, the seconds are not truncated, see below code. The internal long is not {*}86400 * 100{*}, but it's ({*}86400 + 1) * 100{*}{*}{{*}} {code:java} test("my test") { val data = Seq(Row(Duration.ofDays(1).plusSeconds(1))) val schema = StructType(Array( StructField("t", DayTimeIntervalType(DayTimeIntervalType.DAY, DayTimeIntervalType.DAY)) )) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.show() } {code} After debug, the {{endField}} is always {{SECOND}} in {{{}durationToMicros{}}}, see below: {code:java} // IntervalUtils class def durationToMicros(duration: Duration): Long = { durationToMicros(duration, DT.SECOND) // always SECOND } def durationToMicros(duration: Duration, endField: Byte) {code} Seems should use different endField which could be [DAY, HOUR, MINUTE, SECOND] Or Spark can throw an exception to avoid truncating. was: *Problem:* ANSI interval types are store as long internally. The long value are not truncated to the expected endField when creating a DataFrame via Duration. *Reproduce:* Create a "day to day" interval, the seconds are not truncated, see below code. The internal long is not {*}86400 * 100{*}, but it's ({*}86400 + 1) * 100{*}{*}{*} {code:java} test("my test") { val data = Seq(Row(Duration.ofDays(1).plusSeconds(1))) val schema = StructType(Array( StructField("t", DayTimeIntervalType(DayTimeIntervalType.DAY, DayTimeIntervalType.DAY)) )) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.show() } {code} After debug, the {{endField}} is always {{SECOND}} in {{{}durationToMicros{}}}, see below: {code:java} // IntervalUtils class def durationToMicros(duration: Duration): Long = { durationToMicros(duration, DT.SECOND) // always SECOND } def durationToMicros(duration: Duration, endField: Byte) {code} Seems should use different endField which could be [DAY, HOUR, MINUTE, SECOND] > Interval types are not truncated to the expected endField when creating a > DataFrame via Duration > > > Key: SPARK-38577 > URL: https://issues.apache.org/jira/browse/SPARK-38577 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 > Environment: Spark 3.3.0 snapshot version > >Reporter: chong >Priority: Major > > *Problem:* > ANSI interval types are store as long internally. > The long value are not truncated to the expected endField when creating a > DataFrame via Duration. > > *Reproduce:* > Create a "day to day" interval, the seconds are not truncated, see below code. > The internal long is not {*}86400 * 100{*}, but it's ({*}86400 + 1) * > 100{*}{*}{{*}} > > {code:java} > test("my test") { > val data = Seq(Row(Duration.ofDays(1).plusSeconds(1))) > val schema = StructType(Array( > StructField("t", DayTimeIntervalType(DayTimeIntervalType.DAY, > DayTimeIntervalType.DAY)) > )) > val df = spark.createDataFrame(spark.sparkContext.parallelize(data), > schema) > df.show() > } {code} > > > After debug, the {{endField}} is always {{SECOND}} in > {{{}durationToMicros{}}}, see below: > > {code:java} > // IntervalUtils class > def durationToMicros(duration: Duration): Long = { > durationToMicros(duration, DT.SECOND) // always SECOND > } > def durationToMicros(duration: Duration, endField: Byte) > {code} > Seems should use different endField which could be [DAY, HOUR, MINUTE, SECOND] > Or Spark can throw an exception to avoid truncating. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38520) Overflow occurs when reading ANSI day time interval from CSV file
[ https://issues.apache.org/jira/browse/SPARK-38520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chong updated SPARK-38520: -- Description: *Problem:* Overflow occurs when reading the following positive intervals, the results become to negative interval '106751992' day => INTERVAL '-106751990' DAY INTERVAL +'+2562047789' hour => INTERVAL '-2562047787' HOUR interval '153722867281' minute => INTERVAL '-153722867280' MINUTE *Reproduce:* {code:java} // days overflow scala> val schema = StructType(Seq(StructField("c1", DayTimeIntervalType(DayTimeIntervalType.DAY, DayTimeIntervalType.DAY scala> spark.read.csv(path).show(false) ++ |_c0 | ++ |interval '106751992' day| ++ scala> spark.read.schema(schema).csv(path).show(false) +-+ |c1 | +-+ |INTERVAL '-106751990' DAY| +-+ // hour overflow scala> val schema = StructType(Seq(StructField("c1", DayTimeIntervalType(DayTimeIntervalType.HOUR, DayTimeIntervalType.HOUR scala> spark.read.csv(path).show(false) ++ |_c0 | ++ |INTERVAL +'+2562047789' hour| ++ scala> spark.read.schema(schema).csv(path).show(false) +---+ |c1 | +---+ |INTERVAL '-2562047787' HOUR| +---+ // minute overflow scala> val schema = StructType(Seq(StructField("c1", DayTimeIntervalType(DayTimeIntervalType.MINUTE, DayTimeIntervalType.MINUTE scala> spark.read.csv(path).show(false) +--+ |_c0 | +--+ |interval '153722867281' minute| +--+ scala> spark.read.schema(schema).csv(path).show(false) +---+ |c1 | +---+ |INTERVAL '-153722867280' MINUTE| +---+ {code} *others:* Also check the negative value is read to positive. others: should check the negative also, was: *Problem:* Overflow occurs when reading the following positive intervals, the results become to negative interval '106751992' day => INTERVAL '-106751990' DAY INTERVAL +'+2562047789' hour => INTERVAL '-2562047787' HOUR interval '153722867281' minute => INTERVAL '-153722867280' MINUTE *Reroduce:* {code} // days overflow scala> val schema = StructType(Seq(StructField("c1", DayTimeIntervalType(DayTimeIntervalType.DAY, DayTimeIntervalType.DAY scala> spark.read.csv(path).show(false) ++ |_c0 | ++ |interval '106751992' day| ++ scala> spark.read.schema(schema).csv(path).show(false) +-+ |c1 | +-+ |INTERVAL '-106751990' DAY| +-+ // hour overflow scala> val schema = StructType(Seq(StructField("c1", DayTimeIntervalType(DayTimeIntervalType.HOUR, DayTimeIntervalType.HOUR scala> spark.read.csv(path).show(false) ++ |_c0 | ++ |INTERVAL +'+2562047789' hour| ++ scala> spark.read.schema(schema).csv(path).show(false) +---+ |c1 | +---+ |INTERVAL '-2562047787' HOUR| +---+ // minute overflow scala> val schema = StructType(Seq(StructField("c1", DayTimeIntervalType(DayTimeIntervalType.MINUTE, DayTimeIntervalType.MINUTE scala> spark.read.csv(path).show(false) +--+ |_c0 | +--+ |interval '153722867281' minute| +--+ scala> spark.read.schema(schema).csv(path).show(false) +---+ |c1 | +---+ |INTERVAL '-153722867280' MINUTE| +---+ {code} *others:* Also check the negative value is read to positive. others: should check the negative also, > Overflow occurs when reading ANSI day time interval from CSV file > - > > Key: SPARK-38520 > URL: https://issues.apache.org/jira/browse/SPARK-38520 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: chong >Priority: Major > > *Problem:* > Overflow occurs when reading the following positive intervals, the results > become to n
[jira] [Updated] (SPARK-38520) Overflow occurs when reading ANSI day time interval from CSV file
[ https://issues.apache.org/jira/browse/SPARK-38520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chong updated SPARK-38520: -- Description: *Problem:* Overflow occurs when reading the following positive intervals, the results become to negative interval '106751992' day => INTERVAL '-106751990' DAY INTERVAL +'+2562047789' hour => INTERVAL '-2562047787' HOUR interval '153722867281' minute => INTERVAL '-153722867280' MINUTE *Reproduce:* {code:java} // days overflow scala> val schema = StructType(Seq(StructField("c1", DayTimeIntervalType(DayTimeIntervalType.DAY, DayTimeIntervalType.DAY scala> spark.read.csv(path).show(false) ++ |_c0 | ++ |interval '106751992' day| ++ scala> spark.read.schema(schema).csv(path).show(false) +-+ |c1 | +-+ |INTERVAL '-106751990' DAY| +-+ // hour overflow scala> val schema = StructType(Seq(StructField("c1", DayTimeIntervalType(DayTimeIntervalType.HOUR, DayTimeIntervalType.HOUR scala> spark.read.csv(path).show(false) ++ |_c0 | ++ |INTERVAL +'+2562047789' hour| ++ scala> spark.read.schema(schema).csv(path).show(false) +---+ |c1 | +---+ |INTERVAL '-2562047787' HOUR| +---+ // minute overflow scala> val schema = StructType(Seq(StructField("c1", DayTimeIntervalType(DayTimeIntervalType.MINUTE, DayTimeIntervalType.MINUTE scala> spark.read.csv(path).show(false) +--+ |_c0 | +--+ |interval '153722867281' minute| +--+ scala> spark.read.schema(schema).csv(path).show(false) +---+ |c1 | +---+ |INTERVAL '-153722867280' MINUTE| +---+ {code} *others:* Also check the negative value is read to positive. others: should check the negative also was: *Problem:* Overflow occurs when reading the following positive intervals, the results become to negative interval '106751992' day => INTERVAL '-106751990' DAY INTERVAL +'+2562047789' hour => INTERVAL '-2562047787' HOUR interval '153722867281' minute => INTERVAL '-153722867280' MINUTE *Reproduce:* {code:java} // days overflow scala> val schema = StructType(Seq(StructField("c1", DayTimeIntervalType(DayTimeIntervalType.DAY, DayTimeIntervalType.DAY scala> spark.read.csv(path).show(false) ++ |_c0 | ++ |interval '106751992' day| ++ scala> spark.read.schema(schema).csv(path).show(false) +-+ |c1 | +-+ |INTERVAL '-106751990' DAY| +-+ // hour overflow scala> val schema = StructType(Seq(StructField("c1", DayTimeIntervalType(DayTimeIntervalType.HOUR, DayTimeIntervalType.HOUR scala> spark.read.csv(path).show(false) ++ |_c0 | ++ |INTERVAL +'+2562047789' hour| ++ scala> spark.read.schema(schema).csv(path).show(false) +---+ |c1 | +---+ |INTERVAL '-2562047787' HOUR| +---+ // minute overflow scala> val schema = StructType(Seq(StructField("c1", DayTimeIntervalType(DayTimeIntervalType.MINUTE, DayTimeIntervalType.MINUTE scala> spark.read.csv(path).show(false) +--+ |_c0 | +--+ |interval '153722867281' minute| +--+ scala> spark.read.schema(schema).csv(path).show(false) +---+ |c1 | +---+ |INTERVAL '-153722867280' MINUTE| +---+ {code} *others:* Also check the negative value is read to positive. others: should check the negative also, > Overflow occurs when reading ANSI day time interval from CSV file > - > > Key: SPARK-38520 > URL: https://issues.apache.org/jira/browse/SPARK-38520 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: chong >Priority: Major > > *Problem:* > Overflow occurs when reading the following positive intervals, the results > become
[jira] [Assigned] (SPARK-38576) Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only
[ https://issues.apache.org/jira/browse/SPARK-38576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38576: Assignee: (was: Apache Spark) > Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank > numeric columns only > --- > > Key: SPARK-38576 > URL: https://issues.apache.org/jira/browse/SPARK-38576 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank > numeric columns only. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38576) Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only
[ https://issues.apache.org/jira/browse/SPARK-38576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38576: Assignee: Apache Spark > Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank > numeric columns only > --- > > Key: SPARK-38576 > URL: https://issues.apache.org/jira/browse/SPARK-38576 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank > numeric columns only. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38576) Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only
[ https://issues.apache.org/jira/browse/SPARK-38576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507931#comment-17507931 ] Apache Spark commented on SPARK-38576: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/35868 > Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank > numeric columns only > --- > > Key: SPARK-38576 > URL: https://issues.apache.org/jira/browse/SPARK-38576 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank > numeric columns only. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38576) Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only
[ https://issues.apache.org/jira/browse/SPARK-38576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507930#comment-17507930 ] Apache Spark commented on SPARK-38576: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/35868 > Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank > numeric columns only > --- > > Key: SPARK-38576 > URL: https://issues.apache.org/jira/browse/SPARK-38576 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank > numeric columns only. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38577) Interval types are not truncated to the expected endField when creating a DataFrame via Duration
chong created SPARK-38577: - Summary: Interval types are not truncated to the expected endField when creating a DataFrame via Duration Key: SPARK-38577 URL: https://issues.apache.org/jira/browse/SPARK-38577 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Environment: Spark 3.3.0 snapshot version Reporter: chong *Problem:* ANSI interval types are store as long internally. The long value are not truncated to the expected endField when creating a DataFrame via Duration. *Reproduce:* Create a "day to day" interval, the seconds are not truncated, see below code. The internal long is not {*}86400 * 100{*}, but it's ({*}86400 + 1) * 100{*}{*}{*} {code:java} test("my test") { val data = Seq(Row(Duration.ofDays(1).plusSeconds(1))) val schema = StructType(Array( StructField("t", DayTimeIntervalType(DayTimeIntervalType.DAY, DayTimeIntervalType.DAY)) )) val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema) df.show() } {code} After debug, the {{endField}} is always {{SECOND}} in {{{}durationToMicros{}}}, see below: {code:java} // IntervalUtils class def durationToMicros(duration: Duration): Long = { durationToMicros(duration, DT.SECOND) // always SECOND } def durationToMicros(duration: Duration, endField: Byte) {code} Seems should use different endField which could be [DAY, HOUR, MINUTE, SECOND] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38576) Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only
Xinrong Meng created SPARK-38576: Summary: Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only Key: SPARK-38576 URL: https://issues.apache.org/jira/browse/SPARK-38576 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38556) Disable Pandas usage logging for method calls inside @contextmanager functions
[ https://issues.apache.org/jira/browse/SPARK-38556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38556. -- Target Version/s: 3.2.1, 3.3.0 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/35861 > Disable Pandas usage logging for method calls inside @contextmanager functions > -- > > Key: SPARK-38556 > URL: https://issues.apache.org/jira/browse/SPARK-38556 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.1 >Reporter: Yihong He >Priority: Minor > > Currently, calls inside @contextmanager functions are treated as external for > *with* statements. > For example, the below code records config.set_option calls inside > ps.option_context(...) > {code:java} > with ps.option_context("compute.ops_on_diff_frames", True): > pass {code} > We should disable usage logging for calls inside @contextmanager functions to > improve accuracy of the usage data > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38441) Support string and bool `regex` in `Series.replace`
[ https://issues.apache.org/jira/browse/SPARK-38441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38441. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 35747 [https://github.com/apache/spark/pull/35747] > Support string and bool `regex` in `Series.replace` > --- > > Key: SPARK-38441 > URL: https://issues.apache.org/jira/browse/SPARK-38441 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > Support string and bool `regex` in `Series.replace` in order to reach parity > with pandas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38441) Support string and bool `regex` in `Series.replace`
[ https://issues.apache.org/jira/browse/SPARK-38441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-38441: Assignee: Xinrong Meng > Support string and bool `regex` in `Series.replace` > --- > > Key: SPARK-38441 > URL: https://issues.apache.org/jira/browse/SPARK-38441 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > Support string and bool `regex` in `Series.replace` in order to reach parity > with pandas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38572. --- Resolution: Fixed Issue resolved by pull request 35879 [https://github.com/apache/spark/pull/35879] > Setting version to 3.4.0-SNAPSHOT > - > > Key: SPARK-38572 > URL: https://issues.apache.org/jira/browse/SPARK-38572 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38575) Duduplicate branch specification in GitHub Actions workflow
[ https://issues.apache.org/jira/browse/SPARK-38575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507921#comment-17507921 ] Apache Spark commented on SPARK-38575: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/35882 > Duduplicate branch specification in GitHub Actions workflow > --- > > Key: SPARK-38575 > URL: https://issues.apache.org/jira/browse/SPARK-38575 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently we should make some changes everytime we make branch like > https://github.com/apache/spark/pull/35876. We should ideally make it > automatically working without making such change. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38575) Duduplicate branch specification in GitHub Actions workflow
[ https://issues.apache.org/jira/browse/SPARK-38575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38575: Assignee: (was: Apache Spark) > Duduplicate branch specification in GitHub Actions workflow > --- > > Key: SPARK-38575 > URL: https://issues.apache.org/jira/browse/SPARK-38575 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently we should make some changes everytime we make branch like > https://github.com/apache/spark/pull/35876. We should ideally make it > automatically working without making such change. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38575) Duduplicate branch specification in GitHub Actions workflow
[ https://issues.apache.org/jira/browse/SPARK-38575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38575: Assignee: Apache Spark > Duduplicate branch specification in GitHub Actions workflow > --- > > Key: SPARK-38575 > URL: https://issues.apache.org/jira/browse/SPARK-38575 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Currently we should make some changes everytime we make branch like > https://github.com/apache/spark/pull/35876. We should ideally make it > automatically working without making such change. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38575) Duduplicate branch specification in GitHub Actions workflow
Hyukjin Kwon created SPARK-38575: Summary: Duduplicate branch specification in GitHub Actions workflow Key: SPARK-38575 URL: https://issues.apache.org/jira/browse/SPARK-38575 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.4.0 Reporter: Hyukjin Kwon Currently we should make some changes everytime we make branch like https://github.com/apache/spark/pull/35876. We should ideally make it automatically working without making such change. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2
[ https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38568: -- Affects Version/s: 3.4.0 (was: 3.3.0) > Upgrade ZSTD-JNI to 1.5.2-2 > --- > > Key: SPARK-38568 > URL: https://issues.apache.org/jira/browse/SPARK-38568 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2
[ https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507907#comment-17507907 ] Dongjoon Hyun commented on SPARK-38568: --- Hi, [~yumwang]. When you clone an issue, you should change the `Reporter`. :) For this JIRA, I didn't report this JIRA. It should be you because you created this. Could fix it? > Upgrade ZSTD-JNI to 1.5.2-2 > --- > > Key: SPARK-38568 > URL: https://issues.apache.org/jira/browse/SPARK-38568 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36664) Log time spent waiting for cluster resources
[ https://issues.apache.org/jira/browse/SPARK-36664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507899#comment-17507899 ] Apache Spark commented on SPARK-36664: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/35881 > Log time spent waiting for cluster resources > > > Key: SPARK-36664 > URL: https://issues.apache.org/jira/browse/SPARK-36664 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0, 3.3.0 >Reporter: Holden Karau >Priority: Major > > To provide better visibility into why jobs might be running slow it would be > useful to log when we are waiting for resources and how long we are waiting > for resources so if there is an underlying cluster issue the user can be > aware. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36664) Log time spent waiting for cluster resources
[ https://issues.apache.org/jira/browse/SPARK-36664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507900#comment-17507900 ] Apache Spark commented on SPARK-36664: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/35881 > Log time spent waiting for cluster resources > > > Key: SPARK-36664 > URL: https://issues.apache.org/jira/browse/SPARK-36664 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0, 3.3.0 >Reporter: Holden Karau >Priority: Major > > To provide better visibility into why jobs might be running slow it would be > useful to log when we are waiting for resources and how long we are waiting > for resources so if there is an underlying cluster issue the user can be > aware. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38574) Enrich Avro data source documentation
[ https://issues.apache.org/jira/browse/SPARK-38574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38574: Assignee: Apache Spark > Enrich Avro data source documentation > - > > Key: SPARK-38574 > URL: https://issues.apache.org/jira/browse/SPARK-38574 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Tianhan Hu >Assignee: Apache Spark >Priority: Minor > > Enrich Avro data source documentation to emphasize the difference between > *avroSchema* which is an option, and *jsonFormatSchema* which is a parameter > for function *from_avro* . > When using {*}from_avro{*}, *avroSchema* option can be set to a compatible > and evolved schema, while *jsonFormatSchema* has to be the actual schema. > Elsewise, the behavior is undefined. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38574) Enrich Avro data source documentation
[ https://issues.apache.org/jira/browse/SPARK-38574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38574: Assignee: (was: Apache Spark) > Enrich Avro data source documentation > - > > Key: SPARK-38574 > URL: https://issues.apache.org/jira/browse/SPARK-38574 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Tianhan Hu >Priority: Minor > > Enrich Avro data source documentation to emphasize the difference between > *avroSchema* which is an option, and *jsonFormatSchema* which is a parameter > for function *from_avro* . > When using {*}from_avro{*}, *avroSchema* option can be set to a compatible > and evolved schema, while *jsonFormatSchema* has to be the actual schema. > Elsewise, the behavior is undefined. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38574) Enrich Avro data source documentation
[ https://issues.apache.org/jira/browse/SPARK-38574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507898#comment-17507898 ] Apache Spark commented on SPARK-38574: -- User 'tianhanhu' has created a pull request for this issue: https://github.com/apache/spark/pull/35880 > Enrich Avro data source documentation > - > > Key: SPARK-38574 > URL: https://issues.apache.org/jira/browse/SPARK-38574 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Tianhan Hu >Priority: Minor > > Enrich Avro data source documentation to emphasize the difference between > *avroSchema* which is an option, and *jsonFormatSchema* which is a parameter > for function *from_avro* . > When using {*}from_avro{*}, *avroSchema* option can be set to a compatible > and evolved schema, while *jsonFormatSchema* has to be the actual schema. > Elsewise, the behavior is undefined. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38574) Enrich Avro data source documentation
Tianhan Hu created SPARK-38574: -- Summary: Enrich Avro data source documentation Key: SPARK-38574 URL: https://issues.apache.org/jira/browse/SPARK-38574 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.1 Reporter: Tianhan Hu Enrich Avro data source documentation to emphasize the difference between *avroSchema* which is an option, and *jsonFormatSchema* which is a parameter for function *from_avro* . When using {*}from_avro{*}, *avroSchema* option can be set to a compatible and evolved schema, while *jsonFormatSchema* has to be the actual schema. Elsewise, the behavior is undefined. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38555) Avoid contention and get or create clientPools quickly in the TransportClientFactory
[ https://issues.apache.org/jira/browse/SPARK-38555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-38555. - Assignee: weixiuli Resolution: Fixed > Avoid contention and get or create clientPools quickly in the > TransportClientFactory > - > > Key: SPARK-38555 > URL: https://issues.apache.org/jira/browse/SPARK-38555 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.2.1 >Reporter: weixiuli >Assignee: weixiuli >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38555) Avoid contention and get or create clientPools quickly in the TransportClientFactory
[ https://issues.apache.org/jira/browse/SPARK-38555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated SPARK-38555: Fix Version/s: 3.4.0 > Avoid contention and get or create clientPools quickly in the > TransportClientFactory > - > > Key: SPARK-38555 > URL: https://issues.apache.org/jira/browse/SPARK-38555 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.2.1 >Reporter: weixiuli >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38573) Support Partition Level Statistics Collection
Kazuyuki Tanimura created SPARK-38573: - Summary: Support Partition Level Statistics Collection Key: SPARK-38573 URL: https://issues.apache.org/jira/browse/SPARK-38573 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Kazuyuki Tanimura Currently https://issues.apache.org/jira/browse/SPARK-21127 supports storing the aggregated stats at table level for partitioned tables with config spark.sql.statistics.size.autoUpdate.enabled. Supporting partition level stats are useful to know which partitions are outliers (skewed partition) and query optimizer works better with partition level stats in case of partition pruning. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38545) Upgarde scala-maven-plugin from 4.4.0 to 4.5.6
[ https://issues.apache.org/jira/browse/SPARK-38545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-38545. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 35841 [https://github.com/apache/spark/pull/35841] > Upgarde scala-maven-plugin from 4.4.0 to 4.5.6 > -- > > Key: SPARK-38545 > URL: https://issues.apache.org/jira/browse/SPARK-38545 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > 4.5.6 upgrades zinc dependency to 1.5.8 and cleans up some unnecessary > cascading dependencies > > https://github.com/davidB/scala-maven-plugin/compare/4.4.0...4.5.6 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38545) Upgarde scala-maven-plugin from 4.4.0 to 4.5.6
[ https://issues.apache.org/jira/browse/SPARK-38545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-38545: Assignee: Yang Jie > Upgarde scala-maven-plugin from 4.4.0 to 4.5.6 > -- > > Key: SPARK-38545 > URL: https://issues.apache.org/jira/browse/SPARK-38545 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > 4.5.6 upgrades zinc dependency to 1.5.8 and cleans up some unnecessary > cascading dependencies > > https://github.com/davidB/scala-maven-plugin/compare/4.4.0...4.5.6 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507808#comment-17507808 ] Apache Spark commented on SPARK-38572: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/35879 > Setting version to 3.4.0-SNAPSHOT > - > > Key: SPARK-38572 > URL: https://issues.apache.org/jira/browse/SPARK-38572 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507807#comment-17507807 ] Apache Spark commented on SPARK-38572: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/35879 > Setting version to 3.4.0-SNAPSHOT > - > > Key: SPARK-38572 > URL: https://issues.apache.org/jira/browse/SPARK-38572 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38572: Assignee: Max Gekk (was: Apache Spark) > Setting version to 3.4.0-SNAPSHOT > - > > Key: SPARK-38572 > URL: https://issues.apache.org/jira/browse/SPARK-38572 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38572: Assignee: Apache Spark (was: Max Gekk) > Setting version to 3.4.0-SNAPSHOT > - > > Key: SPARK-38572 > URL: https://issues.apache.org/jira/browse/SPARK-38572 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38561) Add doc for "Customized Kubernetes Schedulers"
[ https://issues.apache.org/jira/browse/SPARK-38561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-38561. -- Fix Version/s: 3.3.0 Assignee: Yikun Jiang Resolution: Fixed > Add doc for "Customized Kubernetes Schedulers" > -- > > Key: SPARK-38561 > URL: https://issues.apache.org/jira/browse/SPARK-38561 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38503) Add warn for getAdditionalPreKubernetesResources in executor side
[ https://issues.apache.org/jira/browse/SPARK-38503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-38503: - Target Version/s: 3.4.0 > Add warn for getAdditionalPreKubernetesResources in executor side > - > > Key: SPARK-38503 > URL: https://issues.apache.org/jira/browse/SPARK-38503 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507780#comment-17507780 ] Max Gekk commented on SPARK-38572: -- FYI, I am working on this. > Setting version to 3.4.0-SNAPSHOT > - > > Key: SPARK-38572 > URL: https://issues.apache.org/jira/browse/SPARK-38572 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-38572: - Reporter: Max Gekk (was: Dongjoon Hyun) > Setting version to 3.4.0-SNAPSHOT > - > > Key: SPARK-38572 > URL: https://issues.apache.org/jira/browse/SPARK-38572 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-38572: - Fix Version/s: 3.4.0 (was: 3.3.0) > Setting version to 3.4.0-SNAPSHOT > - > > Key: SPARK-38572 > URL: https://issues.apache.org/jira/browse/SPARK-38572 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-38572: - Affects Version/s: 3.4.0 (was: 3.3.0) > Setting version to 3.4.0-SNAPSHOT > - > > Key: SPARK-38572 > URL: https://issues.apache.org/jira/browse/SPARK-38572 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-38572: Assignee: Max Gekk (was: Dongjoon Hyun) > Setting version to 3.4.0-SNAPSHOT > - > > Key: SPARK-38572 > URL: https://issues.apache.org/jira/browse/SPARK-38572 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT
Max Gekk created SPARK-38572: Summary: Setting version to 3.4.0-SNAPSHOT Key: SPARK-38572 URL: https://issues.apache.org/jira/browse/SPARK-38572 Project: Spark Issue Type: Task Components: Build Affects Versions: 3.3.0 Reporter: Dongjoon Hyun Assignee: Dongjoon Hyun Fix For: 3.3.0 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38194) Make Yarn memory overhead factor configurable
[ https://issues.apache.org/jira/browse/SPARK-38194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-38194: -- Fix Version/s: 3.3.0 (was: 3.4.0) > Make Yarn memory overhead factor configurable > - > > Key: SPARK-38194 > URL: https://issues.apache.org/jira/browse/SPARK-38194 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.2.1 >Reporter: Adam Binford >Assignee: Adam Binford >Priority: Major > Fix For: 3.3.0 > > > Currently if the memory overhead is not provided for a Yarn job, it defaults > to 10% of the respective driver/executor memory. This 10% is hard-coded and > the only way to increase memory overhead is to set the exact memory overhead. > We have seen more than 10% memory being used, and it would be helpful to be > able to set the default overhead factor so that the overhead doesn't need to > be pre-calculated for any driver/executor memory size. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38194) Make Yarn memory overhead factor configurable
[ https://issues.apache.org/jira/browse/SPARK-38194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-38194: -- Fix Version/s: 3.4.0 (was: 3.3.0) > Make Yarn memory overhead factor configurable > - > > Key: SPARK-38194 > URL: https://issues.apache.org/jira/browse/SPARK-38194 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.2.1 >Reporter: Adam Binford >Assignee: Adam Binford >Priority: Major > Fix For: 3.4.0 > > > Currently if the memory overhead is not provided for a Yarn job, it defaults > to 10% of the respective driver/executor memory. This 10% is hard-coded and > the only way to increase memory overhead is to set the exact memory overhead. > We have seen more than 10% memory being used, and it would be helpful to be > able to set the default overhead factor so that the overhead doesn't need to > be pre-calculated for any driver/executor memory size. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38194) Make Yarn memory overhead factor configurable
[ https://issues.apache.org/jira/browse/SPARK-38194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-38194. --- Fix Version/s: 3.3.0 Assignee: Adam Binford Resolution: Fixed > Make Yarn memory overhead factor configurable > - > > Key: SPARK-38194 > URL: https://issues.apache.org/jira/browse/SPARK-38194 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.2.1 >Reporter: Adam Binford >Assignee: Adam Binford >Priority: Major > Fix For: 3.3.0 > > > Currently if the memory overhead is not provided for a Yarn job, it defaults > to 10% of the respective driver/executor memory. This 10% is hard-coded and > the only way to increase memory overhead is to set the exact memory overhead. > We have seen more than 10% memory being used, and it would be helpful to be > able to set the default overhead factor so that the overhead doesn't need to > be pre-calculated for any driver/executor memory size. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38571) Week of month from a date is missing in spark3 for return values of 1 to 6
[ https://issues.apache.org/jira/browse/SPARK-38571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Appel updated SPARK-38571: Description: In Spark2 we could use the date_format function with either the W or F flags to compute week of month from a date. These are computing two different items, the W is having values from 1 to 6 and the F is having values from 1 to 5 Sample code and output of expected df1 = spark.createDataFrame( [ (1, date(2014, 3, 7)), (2, date(2014, 3, 8)), (3, date(2014, 3, 30)), (4, date(2014, 3, 31)), (5, date(2015, 3, 7)), (6, date(2015, 3, 8)), (7, date(2015, 3, 30)), (8, date(2015, 3, 31)), ], schema="a long, b date", ) df1 = df1.withColumn("WEEKOFMONTH1-6", F.date_format(F.col("b"), "W")) df1 = df1.withColumn("WEEKOFMONTH1-5", F.date_format(F.col("b"), "F")) df1.show() {+}--{-}{-}{+}{-}++{-}{-}{-}-+ | a| b|WEEKOFMONTH1-6|WEEKOFMONTH1-5| {+}--{-}{-}{+}{-}++{-}{-}{-}-+ | 1|2014-03-07| 2| 1| | 2|2014-03-08| 2| 2| | 3|2014-03-30| 6| 5| | 4|2014-03-31| 6| 5| | 5|2015-03-07| 1| 1| | 6|2015-03-08| 2| 2| | 7|2015-03-30| 5| 5| | 8|2015-03-31| 5| 5| {+}--{-}{-}{+}{-}++{-}{-}{-}-+ With the Spark3 having the spark.sql.legacy.timeParserPolicy set to EXCEPTION by default this throws an error: Caused by: java.lang.IllegalArgumentException: All week-based patterns are unsupported since Spark 3.0, detected: W, Please use the SQL function EXTRACT instead However from the EXTRACT function there is nothing available that is extracting the week of month for the values 1 to 6 The Spark3 mentions they define our own patterns located at [https://spark.apache.org/docs/3.2.1/sql-ref-datetime-pattern.html] that are implemented via DateTimeFormatter under the hood: [https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html] That site is listing both W and F for week of month W week-of-month number 4 F week-of-month number 3 However only F is implemented on the datetime pattern reference Is there another way we can compute this week of month for values 1 to 6 by still using the builtin with Spark3? Currently we have to set the spark.sql.legacy.timeParserPolicy to LEGACY in order to run this Thank you, Kevin was: In Spark2 we could use the date_format function with either the W or F flags to compute week of month from a date. These are computing two different items, the W is having values from 1 to 6 and the F is having values from 1 to 5 Sample code and output of expected ``` python df1 = spark.createDataFrame( [ (1, date(2014, 3, 7)), (2, date(2014, 3, 8)), (3, date(2014, 3, 30)), (4, date(2014, 3, 31)), (5, date(2015, 3, 7)), (6, date(2015, 3, 8)), (7, date(2015, 3, 30)), (8, date(2015, 3, 31)), ], schema="a long, b date", ) df1 = df1.withColumn("WEEKOFMONTH1-6", F.date_format(F.col("b"), "W")) df1 = df1.withColumn("WEEKOFMONTH1-5", F.date_format(F.col("b"), "F")) df1.show() ``` +---+--+--+--+ | a| b|WEEKOFMONTH1-6|WEEKOFMONTH1-5| +---+--+--+--+ | 1|2014-03-07| 2| 1| | 2|2014-03-08| 2| 2| | 3|2014-03-30| 6| 5| | 4|2014-03-31| 6| 5| | 5|2015-03-07| 1| 1| | 6|2015-03-08| 2| 2| | 7|2015-03-30| 5| 5| | 8|2015-03-31| 5| 5| +---+--+--+--+ With the Spark3 having the spark.sql.legacy.timeParserPolicy set to EXCEPTION by default this throws an error: Caused by: java.lang.IllegalArgumentException: All week-based patterns are unsupported since Spark 3.0, detected: W, Please use the SQL function EXTRACT instead However from the EXTRACT function there is nothing available that is extracting the week of month for the values 1 to 6 The Spark3 mentions they define our own patterns located at [https://spark.apache.org/docs/3.2.1/sql-ref-datetime-pattern.html] that are implemented via DateTimeFormatter under the hood: [https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html] That site is listing both W and F for week of month W week-of-month number 4 F week-of-month number 3
[jira] [Created] (SPARK-38571) Week of month from a date is missing in spark3 for return values of 1 to 6
Kevin Appel created SPARK-38571: --- Summary: Week of month from a date is missing in spark3 for return values of 1 to 6 Key: SPARK-38571 URL: https://issues.apache.org/jira/browse/SPARK-38571 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.2 Reporter: Kevin Appel In Spark2 we could use the date_format function with either the W or F flags to compute week of month from a date. These are computing two different items, the W is having values from 1 to 6 and the F is having values from 1 to 5 Sample code and output of expected ``` python df1 = spark.createDataFrame( [ (1, date(2014, 3, 7)), (2, date(2014, 3, 8)), (3, date(2014, 3, 30)), (4, date(2014, 3, 31)), (5, date(2015, 3, 7)), (6, date(2015, 3, 8)), (7, date(2015, 3, 30)), (8, date(2015, 3, 31)), ], schema="a long, b date", ) df1 = df1.withColumn("WEEKOFMONTH1-6", F.date_format(F.col("b"), "W")) df1 = df1.withColumn("WEEKOFMONTH1-5", F.date_format(F.col("b"), "F")) df1.show() ``` +---+--+--+--+ | a| b|WEEKOFMONTH1-6|WEEKOFMONTH1-5| +---+--+--+--+ | 1|2014-03-07| 2| 1| | 2|2014-03-08| 2| 2| | 3|2014-03-30| 6| 5| | 4|2014-03-31| 6| 5| | 5|2015-03-07| 1| 1| | 6|2015-03-08| 2| 2| | 7|2015-03-30| 5| 5| | 8|2015-03-31| 5| 5| +---+--+--+--+ With the Spark3 having the spark.sql.legacy.timeParserPolicy set to EXCEPTION by default this throws an error: Caused by: java.lang.IllegalArgumentException: All week-based patterns are unsupported since Spark 3.0, detected: W, Please use the SQL function EXTRACT instead However from the EXTRACT function there is nothing available that is extracting the week of month for the values 1 to 6 The Spark3 mentions they define our own patterns located at [https://spark.apache.org/docs/3.2.1/sql-ref-datetime-pattern.html] that are implemented via DateTimeFormatter under the hood: [https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html] That site is listing both W and F for week of month W week-of-month number 4 F week-of-month number 3 However only F is implemented on the datetime pattern reference Is there another way we can compute this week of month for values 1 to 6 by still using the builtin with Spark3? Currently we have to set the spark.sql.legacy.timeParserPolicy to LEGACY in order to run this Thank you, Kevin -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal
[ https://issues.apache.org/jira/browse/SPARK-38570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507626#comment-17507626 ] Apache Spark commented on SPARK-38570: -- User 'mcdull-zhang' has created a pull request for this issue: https://github.com/apache/spark/pull/35878 > Incorrect DynamicPartitionPruning caused by Literal > --- > > Key: SPARK-38570 > URL: https://issues.apache.org/jira/browse/SPARK-38570 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: mcdull_zhang >Priority: Minor > > The return value of Literal.references is an empty AttributeSet, so Literal > is mistaken for a partition column. > > org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan: > {code:java} > val srcInfo: Option[(Expression, LogicalPlan)] = > findExpressionAndTrackLineageDown(a, plan) > srcInfo.flatMap { > case (resExp, l: LogicalRelation) => > l.relation match { > case fs: HadoopFsRelation => > val partitionColumns = AttributeSet( > l.resolve(fs.partitionSchema, > fs.sparkSession.sessionState.analyzer.resolver)) > // When resExp is a Literal, Literal is considered a partition > column. > if (resExp.references.subsetOf(partitionColumns)) { > return Some(l) > } else { > None > } > case _ => None > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal
[ https://issues.apache.org/jira/browse/SPARK-38570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38570: Assignee: Apache Spark > Incorrect DynamicPartitionPruning caused by Literal > --- > > Key: SPARK-38570 > URL: https://issues.apache.org/jira/browse/SPARK-38570 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: mcdull_zhang >Assignee: Apache Spark >Priority: Minor > > The return value of Literal.references is an empty AttributeSet, so Literal > is mistaken for a partition column. > > org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan: > {code:java} > val srcInfo: Option[(Expression, LogicalPlan)] = > findExpressionAndTrackLineageDown(a, plan) > srcInfo.flatMap { > case (resExp, l: LogicalRelation) => > l.relation match { > case fs: HadoopFsRelation => > val partitionColumns = AttributeSet( > l.resolve(fs.partitionSchema, > fs.sparkSession.sessionState.analyzer.resolver)) > // When resExp is a Literal, Literal is considered a partition > column. > if (resExp.references.subsetOf(partitionColumns)) { > return Some(l) > } else { > None > } > case _ => None > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal
[ https://issues.apache.org/jira/browse/SPARK-38570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38570: Assignee: (was: Apache Spark) > Incorrect DynamicPartitionPruning caused by Literal > --- > > Key: SPARK-38570 > URL: https://issues.apache.org/jira/browse/SPARK-38570 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: mcdull_zhang >Priority: Minor > > The return value of Literal.references is an empty AttributeSet, so Literal > is mistaken for a partition column. > > org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan: > {code:java} > val srcInfo: Option[(Expression, LogicalPlan)] = > findExpressionAndTrackLineageDown(a, plan) > srcInfo.flatMap { > case (resExp, l: LogicalRelation) => > l.relation match { > case fs: HadoopFsRelation => > val partitionColumns = AttributeSet( > l.resolve(fs.partitionSchema, > fs.sparkSession.sessionState.analyzer.resolver)) > // When resExp is a Literal, Literal is considered a partition > column. > if (resExp.references.subsetOf(partitionColumns)) { > return Some(l) > } else { > None > } > case _ => None > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal
[ https://issues.apache.org/jira/browse/SPARK-38570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mcdull_zhang updated SPARK-38570: - Description: The return value of Literal.references is an empty AttributeSet, so Literal is mistaken for a partition column. org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan: {code:java} val srcInfo: Option[(Expression, LogicalPlan)] = findExpressionAndTrackLineageDown(a, plan) srcInfo.flatMap { case (resExp, l: LogicalRelation) => l.relation match { case fs: HadoopFsRelation => val partitionColumns = AttributeSet( l.resolve(fs.partitionSchema, fs.sparkSession.sessionState.analyzer.resolver)) // When resExp is a Literal, Literal is considered a partition column. if (resExp.references.subsetOf(partitionColumns)) { return Some(l) } else { None } case _ => None } {code} was: The return value of Literal.references is an empty AttributeSet, so Literal is mistaken for a partition column. org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan: {code:java} val srcInfo: Option[(Expression, LogicalPlan)] = findExpressionAndTrackLineageDown(a, plan) srcInfo.flatMap { case (resExp, l: LogicalRelation) => l.relation match { case fs: HadoopFsRelation => val partitionColumns = AttributeSet( l.resolve(fs.partitionSchema, fs.sparkSession.sessionState.analyzer.resolver)) // When resExp is a Literal, Literal is considered a partition column. if (resExp.references.subsetOf(partitionColumns)) { return Some(l) } else { None } case _ => None } {code} > Incorrect DynamicPartitionPruning caused by Literal > --- > > Key: SPARK-38570 > URL: https://issues.apache.org/jira/browse/SPARK-38570 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: mcdull_zhang >Priority: Minor > > The return value of Literal.references is an empty AttributeSet, so Literal > is mistaken for a partition column. > > org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan: > {code:java} > val srcInfo: Option[(Expression, LogicalPlan)] = > findExpressionAndTrackLineageDown(a, plan) > srcInfo.flatMap { > case (resExp, l: LogicalRelation) => > l.relation match { > case fs: HadoopFsRelation => > val partitionColumns = AttributeSet( > l.resolve(fs.partitionSchema, > fs.sparkSession.sessionState.analyzer.resolver)) > // When resExp is a Literal, Literal is considered a partition > column. > if (resExp.references.subsetOf(partitionColumns)) { > return Some(l) > } else { > None > } > case _ => None > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal
mcdull_zhang created SPARK-38570: Summary: Incorrect DynamicPartitionPruning caused by Literal Key: SPARK-38570 URL: https://issues.apache.org/jira/browse/SPARK-38570 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: mcdull_zhang The return value of Literal.references is an empty AttributeSet, so Literal is mistaken for a partition column. org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan: {code:java} val srcInfo: Option[(Expression, LogicalPlan)] = findExpressionAndTrackLineageDown(a, plan) srcInfo.flatMap { case (resExp, l: LogicalRelation) => l.relation match { case fs: HadoopFsRelation => val partitionColumns = AttributeSet( l.resolve(fs.partitionSchema, fs.sparkSession.sessionState.analyzer.resolver)) // When resExp is a Literal, Literal is considered a partition column. if (resExp.references.subsetOf(partitionColumns)) { return Some(l) } else { None } case _ => None } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2
[ https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507585#comment-17507585 ] Apache Spark commented on SPARK-38568: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/35877 > Upgrade ZSTD-JNI to 1.5.2-2 > --- > > Key: SPARK-38568 > URL: https://issues.apache.org/jira/browse/SPARK-38568 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2
[ https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38568: Assignee: Apache Spark > Upgrade ZSTD-JNI to 1.5.2-2 > --- > > Key: SPARK-38568 > URL: https://issues.apache.org/jira/browse/SPARK-38568 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2
[ https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38568: Assignee: (was: Apache Spark) > Upgrade ZSTD-JNI to 1.5.2-2 > --- > > Key: SPARK-38568 > URL: https://issues.apache.org/jira/browse/SPARK-38568 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2
[ https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507584#comment-17507584 ] Apache Spark commented on SPARK-38568: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/35877 > Upgrade ZSTD-JNI to 1.5.2-2 > --- > > Key: SPARK-38568 > URL: https://issues.apache.org/jira/browse/SPARK-38568 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38569) external top-level directory is problematic for bazel
[ https://issues.apache.org/jira/browse/SPARK-38569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507583#comment-17507583 ] Apache Spark commented on SPARK-38569: -- User 'alkis' has created a pull request for this issue: https://github.com/apache/spark/pull/35874 > external top-level directory is problematic for bazel > - > > Key: SPARK-38569 > URL: https://issues.apache.org/jira/browse/SPARK-38569 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.1 >Reporter: Alkis Evlogimenos >Priority: Minor > Labels: build > > {{external}} is a hardwired special name for top-level directories for > [bazel|https://bazel.build/]. This causes all sorts of issues with both > native/basic bazel or extensions like > [bazel-compile-commands-extractor|https://github.com/hedronvision/bazel-compile-commands-extractor]. > Spark forks using bazel to build Spark have to go through hoops to make > things work if at all. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38569) external top-level directory is problematic for bazel
[ https://issues.apache.org/jira/browse/SPARK-38569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38569: Assignee: Apache Spark > external top-level directory is problematic for bazel > - > > Key: SPARK-38569 > URL: https://issues.apache.org/jira/browse/SPARK-38569 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.1 >Reporter: Alkis Evlogimenos >Assignee: Apache Spark >Priority: Minor > Labels: build > > {{external}} is a hardwired special name for top-level directories for > [bazel|https://bazel.build/]. This causes all sorts of issues with both > native/basic bazel or extensions like > [bazel-compile-commands-extractor|https://github.com/hedronvision/bazel-compile-commands-extractor]. > Spark forks using bazel to build Spark have to go through hoops to make > things work if at all. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38569) external top-level directory is problematic for bazel
[ https://issues.apache.org/jira/browse/SPARK-38569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38569: Assignee: (was: Apache Spark) > external top-level directory is problematic for bazel > - > > Key: SPARK-38569 > URL: https://issues.apache.org/jira/browse/SPARK-38569 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.1 >Reporter: Alkis Evlogimenos >Priority: Minor > Labels: build > > {{external}} is a hardwired special name for top-level directories for > [bazel|https://bazel.build/]. This causes all sorts of issues with both > native/basic bazel or extensions like > [bazel-compile-commands-extractor|https://github.com/hedronvision/bazel-compile-commands-extractor]. > Spark forks using bazel to build Spark have to go through hoops to make > things work if at all. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38569) external top-level directory is problematic for bazel
Alkis Evlogimenos created SPARK-38569: - Summary: external top-level directory is problematic for bazel Key: SPARK-38569 URL: https://issues.apache.org/jira/browse/SPARK-38569 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.2.1 Reporter: Alkis Evlogimenos {{external}} is a hardwired special name for top-level directories for [bazel|https://bazel.build/]. This causes all sorts of issues with both native/basic bazel or extensions like [bazel-compile-commands-extractor|https://github.com/hedronvision/bazel-compile-commands-extractor]. Spark forks using bazel to build Spark have to go through hoops to make things work if at all. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2
Yuming Wang created SPARK-38568: --- Summary: Upgrade ZSTD-JNI to 1.5.2-2 Key: SPARK-38568 URL: https://issues.apache.org/jira/browse/SPARK-38568 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.3.0 Reporter: Dongjoon Hyun Assignee: Dongjoon Hyun Fix For: 3.3.0 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2
[ https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-38568: Fix Version/s: (was: 3.3.0) > Upgrade ZSTD-JNI to 1.5.2-2 > --- > > Key: SPARK-38568 > URL: https://issues.apache.org/jira/browse/SPARK-38568 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2
[ https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-38568: --- Assignee: (was: Dongjoon Hyun) > Upgrade ZSTD-JNI to 1.5.2-2 > --- > > Key: SPARK-38568 > URL: https://issues.apache.org/jira/browse/SPARK-38568 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]
[ https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507557#comment-17507557 ] Steve Loughran commented on SPARK-38330: sorry about that. try enabling path style access and see if that helps > Certificate doesn't match any of the subject alternative names: > [*.s3.amazonaws.com, s3.amazonaws.com] > -- > > Key: SPARK-38330 > URL: https://issues.apache.org/jira/browse/SPARK-38330 > Project: Spark > Issue Type: Bug > Components: EC2 >Affects Versions: 3.2.1 > Environment: Spark 3.2.1 built with `hadoop-cloud` flag. > Direct access to s3 using default file committer. > JDK8. > >Reporter: André F. >Priority: Major > > Trying to run any job after bumping our Spark version from 3.1.2 to 3.2.1, > lead us to the current exception while reading files on s3: > {code:java} > org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on > s3a:///.parquet: com.amazonaws.SdkClientException: Unable to > execute HTTP request: Certificate for doesn't match > any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: > Unable to execute HTTP request: Certificate for doesn't match any of > the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at > org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) > at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) {code} > > {code:java} > Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for > doesn't match any of the subject alternative names: > [*.s3.amazonaws.com, s3.amazonaws.com] > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507) > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437) > at > com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384) > at > com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) > at > com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) > at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76) > at com.amazonaws.http.conn.$Proxy16.connect(Unknown Source) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) > at > com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) > at > com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) > at > com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) > at > com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) > at > com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1333) > at > com.amazonaws.http.AmazonHttpClient$RequestExecutor.e
[jira] [Assigned] (SPARK-38567) Enable GitHub Action build_and_test on branch-3.3
[ https://issues.apache.org/jira/browse/SPARK-38567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-38567: Assignee: Max Gekk > Enable GitHub Action build_and_test on branch-3.3 > - > > Key: SPARK-38567 > URL: https://issues.apache.org/jira/browse/SPARK-38567 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Max Gekk >Priority: Major > > See https://issues.apache.org/jira/browse/SPARK-35995 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38567) Enable GitHub Action build_and_test on branch-3.3
[ https://issues.apache.org/jira/browse/SPARK-38567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-38567. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35876 [https://github.com/apache/spark/pull/35876] > Enable GitHub Action build_and_test on branch-3.3 > - > > Key: SPARK-38567 > URL: https://issues.apache.org/jira/browse/SPARK-38567 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0 > > > See https://issues.apache.org/jira/browse/SPARK-35995 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38567) Enable GitHub Action build_and_test on branch-3.3
[ https://issues.apache.org/jira/browse/SPARK-38567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38567: Assignee: Apache Spark > Enable GitHub Action build_and_test on branch-3.3 > - > > Key: SPARK-38567 > URL: https://issues.apache.org/jira/browse/SPARK-38567 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > See https://issues.apache.org/jira/browse/SPARK-35995 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38567) Enable GitHub Action build_and_test on branch-3.3
[ https://issues.apache.org/jira/browse/SPARK-38567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38567: Assignee: (was: Apache Spark) > Enable GitHub Action build_and_test on branch-3.3 > - > > Key: SPARK-38567 > URL: https://issues.apache.org/jira/browse/SPARK-38567 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > See https://issues.apache.org/jira/browse/SPARK-35995 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38567) Enable GitHub Action build_and_test on branch-3.3
[ https://issues.apache.org/jira/browse/SPARK-38567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507532#comment-17507532 ] Apache Spark commented on SPARK-38567: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/35876 > Enable GitHub Action build_and_test on branch-3.3 > - > > Key: SPARK-38567 > URL: https://issues.apache.org/jira/browse/SPARK-38567 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > See https://issues.apache.org/jira/browse/SPARK-35995 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38108) Use error classes in the compilation errors of UDF/UDAF
[ https://issues.apache.org/jira/browse/SPARK-38108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507528#comment-17507528 ] huangtengfei commented on SPARK-38108: -- I am working on this. Thanks [~maxgekk] > Use error classes in the compilation errors of UDF/UDAF > --- > > Key: SPARK-38108 > URL: https://issues.apache.org/jira/browse/SPARK-38108 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryCompilationErrors: > * noHandlerForUDAFError > * unexpectedEvalTypesForUDFsError > * usingUntypedScalaUDFError > * udfClassDoesNotImplementAnyUDFInterfaceError > * udfClassNotAllowedToImplementMultiUDFInterfacesError > * udfClassWithTooManyTypeArgumentsError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryCompilationErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org