date:20220316

[jira] [Updated] (SPARK-38584) Unify the data validation

2022-03-16 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-38584:
-
Description: 
1, input vector validation is missing in most algorithms, when the input 
dataset contains some invalid values (NaN/Infinity), then:
 * the training may run successfuly and return model invalid coefficients, like 
LinearSVC
 * the training will fail with irrelevant message, like KMeans

 
{code:java}
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.classification._
import org.apache.spark.ml.clustering._
val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)), 
LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0.toDF()

val svc = new LinearSVC()
val model = svc.fit(df)

scala> model.intercept
res0: Double = NaN

scala> model.coefficients
res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN]

val km = new KMeans().setK(2)
scala> km.fit(df)
22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 113)
java.lang.IllegalArgumentException: requirement failed: Both norms should be 
greater or equal to 0.0, found norm1=NaN, norm2=Infinity
    at scala.Predef$.require(Predef.scala:281)
    at 
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)
{code}
 

2, relative methods to validate input dataset (like labels/weights) exists in 

{{{}org.apache.spark.ml.functions{}}}, org.apache.spark.ml.util.DatasetUtils, 
org.apache.spark.ml.util.MetadataUtils, etc.

 

I think it is time to unify realtive methods to one source file.

 

  was:
1, input vector validation is missing in most algorithms, when the input 
dataset contains some invalid values (NaN/Infinity), then:
 * the training may run successfuly with invalid model, like LinearSVC
 * the training will fail with irrelevant message, like KMeans

 
{code:java}
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.classification._
import org.apache.spark.ml.clustering._
val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)), 
LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0.toDF()

val svc = new LinearSVC()
val model = svc.fit(df)

scala> model.intercept
res0: Double = NaN

scala> model.coefficients
res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN]

val km = new KMeans().setK(2)
scala> km.fit(df)
22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 113)
java.lang.IllegalArgumentException: requirement failed: Both norms should be 
greater or equal to 0.0, found norm1=NaN, norm2=Infinity
    at scala.Predef$.require(Predef.scala:281)
    at 
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)
{code}
 

2, relative methods to validate input dataset (like labels/weights) exists in 

{{{}org.apache.spark.ml.functions{}}}, org.apache.spark.ml.util.DatasetUtils, 
org.apache.spark.ml.util.MetadataUtils, etc.

 

I think it is time to unify realtive methods to one source file.

 


> Unify the data validation
> -
>
> Key: SPARK-38584
> URL: https://issues.apache.org/jira/browse/SPARK-38584
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Priority: Major
>
> 1, input vector validation is missing in most algorithms, when the input 
> dataset contains some invalid values (NaN/Infinity), then:
>  * the training may run successfuly and return model invalid coefficients, 
> like LinearSVC
>  * the training will fail with irrelevant message, like KMeans
>  
> {code:java}
> import org.apache.spark.ml.feature._
> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.classification._
> import org.apache.spark.ml.clustering._
> val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, 
> Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 
> 2.0.toDF()
> val svc = new LinearSVC()
> val model = svc.fit(df)
> scala> model.intercept
> res0: Double = NaN
> scala> model.coefficients
> res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN]
> val km = new KMeans().setK(2)
> scala> km.fit(df)
> 22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 
> 113)
> java.lang.IllegalArgumentException: requirement failed: Both norms should be 
> greater or equal to 0.0, found norm1=NaN, norm2=Infinity
>     at scala.Predef$.require(Predef.scala:281)
>     at 
> org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)
> {code}
>  
> 2, relative methods to validate input dataset (like labels/weights) exists in 
> {{{}org.apache.spark.ml.functions{}}}, org.apache.spark.ml.util.DatasetUtils, 
> org.apache.spark.ml.util.MetadataUtils, etc.
>  
> I think it is time to unify realtive methods t

[jira] [Created] (SPARK-38584) Unify the data validation

2022-03-16 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-38584:


 Summary: Unify the data validation
 Key: SPARK-38584
 URL: https://issues.apache.org/jira/browse/SPARK-38584
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.4.0
Reporter: zhengruifeng


1, input vector validation is missing in most algorithms, when the input 
dataset contains some invalid values (NaN/Infinity), then:
 * the training may run successfuly with invalid model, like LinearSVC
 * the training will fail with irrelevant message, like KMeans

 
{code:java}
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.classification._
import org.apache.spark.ml.clustering._
val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, Double.NaN)), 
LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 2.0.toDF()

val svc = new LinearSVC()
val model = svc.fit(df)

scala> model.intercept
res0: Double = NaN

scala> model.coefficients
res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN]

val km = new KMeans().setK(2)
scala> km.fit(df)
22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID 113)
java.lang.IllegalArgumentException: requirement failed: Both norms should be 
greater or equal to 0.0, found norm1=NaN, norm2=Infinity
    at scala.Predef$.require(Predef.scala:281)
    at 
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543)
{code}
 

2, relative methods to validate input dataset (like labels/weights) exists in 

{{{}org.apache.spark.ml.functions{}}}, org.apache.spark.ml.util.DatasetUtils, 
org.apache.spark.ml.util.MetadataUtils, etc.

 

I think it is time to unify realtive methods to one source file.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38582) Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for `KubernetesUtils` to eliminate duplicate code pattern

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38582:


Assignee: (was: Apache Spark)

> Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for 
> `KubernetesUtils` to eliminate duplicate code pattern
> ---
>
> Key: SPARK-38582
> URL: https://issues.apache.org/jira/browse/SPARK-38582
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: qian
>Priority: Minor
>
> There are many duplicate code patterns in Spark Code:
> {code:java}
> new EnvVarBuilder()
>   .withName(key)
>   .withValue(value)
>   .build() {code}
> {code:java}
> new EnvVarBuilder()
>.withName(name)
>  .withValueFrom(new EnvVarSourceBuilder()
>.withNewFieldRef(version, field)
>.build())
>.build()
> {code}
>  
> [The assignment statement for executor envVar | 
> https://github.com/apache/spark/blob/branch-3.3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L123-L185]
>  has 63 lines.  We could introduce _buildEnvVarsWithKV_ and 
> _buildEnvVarsWithFieldRef_ function to simplify the above code patterns.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38582) Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for `KubernetesUtils` to eliminate duplicate code pattern

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507996#comment-17507996
 ] 

Apache Spark commented on SPARK-38582:
--

User 'dcoliversun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35886

> Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for 
> `KubernetesUtils` to eliminate duplicate code pattern
> ---
>
> Key: SPARK-38582
> URL: https://issues.apache.org/jira/browse/SPARK-38582
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: qian
>Priority: Minor
>
> There are many duplicate code patterns in Spark Code:
> {code:java}
> new EnvVarBuilder()
>   .withName(key)
>   .withValue(value)
>   .build() {code}
> {code:java}
> new EnvVarBuilder()
>.withName(name)
>  .withValueFrom(new EnvVarSourceBuilder()
>.withNewFieldRef(version, field)
>.build())
>.build()
> {code}
>  
> [The assignment statement for executor envVar | 
> https://github.com/apache/spark/blob/branch-3.3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L123-L185]
>  has 63 lines.  We could introduce _buildEnvVarsWithKV_ and 
> _buildEnvVarsWithFieldRef_ function to simplify the above code patterns.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38582) Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for `KubernetesUtils` to eliminate duplicate code pattern

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507995#comment-17507995
 ] 

Apache Spark commented on SPARK-38582:
--

User 'dcoliversun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35886

> Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for 
> `KubernetesUtils` to eliminate duplicate code pattern
> ---
>
> Key: SPARK-38582
> URL: https://issues.apache.org/jira/browse/SPARK-38582
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: qian
>Priority: Minor
>
> There are many duplicate code patterns in Spark Code:
> {code:java}
> new EnvVarBuilder()
>   .withName(key)
>   .withValue(value)
>   .build() {code}
> {code:java}
> new EnvVarBuilder()
>.withName(name)
>  .withValueFrom(new EnvVarSourceBuilder()
>.withNewFieldRef(version, field)
>.build())
>.build()
> {code}
>  
> [The assignment statement for executor envVar | 
> https://github.com/apache/spark/blob/branch-3.3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L123-L185]
>  has 63 lines.  We could introduce _buildEnvVarsWithKV_ and 
> _buildEnvVarsWithFieldRef_ function to simplify the above code patterns.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38582) Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for `KubernetesUtils` to eliminate duplicate code pattern

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38582:


Assignee: Apache Spark

> Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for 
> `KubernetesUtils` to eliminate duplicate code pattern
> ---
>
> Key: SPARK-38582
> URL: https://issues.apache.org/jira/browse/SPARK-38582
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: qian
>Assignee: Apache Spark
>Priority: Minor
>
> There are many duplicate code patterns in Spark Code:
> {code:java}
> new EnvVarBuilder()
>   .withName(key)
>   .withValue(value)
>   .build() {code}
> {code:java}
> new EnvVarBuilder()
>.withName(name)
>  .withValueFrom(new EnvVarSourceBuilder()
>.withNewFieldRef(version, field)
>.build())
>.build()
> {code}
>  
> [The assignment statement for executor envVar | 
> https://github.com/apache/spark/blob/branch-3.3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L123-L185]
>  has 63 lines.  We could introduce _buildEnvVarsWithKV_ and 
> _buildEnvVarsWithFieldRef_ function to simplify the above code patterns.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38566) Revert the parser changes for DEFAULT column support

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507994#comment-17507994
 ] 

Apache Spark commented on SPARK-38566:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/35885

> Revert the parser changes for DEFAULT column support
> 
>
> Key: SPARK-38566
> URL: https://issues.apache.org/jira/browse/SPARK-38566
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Blocker
>
> Revert the commit 
> https://github.com/apache/spark/commit/e21cb62d02c85a66771822cdd49c49dbb3e44502
>  from branch-3.3.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38566) Revert the parser changes for DEFAULT column support

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507993#comment-17507993
 ] 

Apache Spark commented on SPARK-38566:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/35885

> Revert the parser changes for DEFAULT column support
> 
>
> Key: SPARK-38566
> URL: https://issues.apache.org/jira/browse/SPARK-38566
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Blocker
>
> Revert the commit 
> https://github.com/apache/spark/commit/e21cb62d02c85a66771822cdd49c49dbb3e44502
>  from branch-3.3.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38583) to_timestamp should allow numeric types

2022-03-16 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-38583:


 Summary: to_timestamp should allow numeric types
 Key: SPARK-38583
 URL: https://issues.apache.org/jira/browse/SPARK-38583
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


SPARK-38240 mistakenly disallowed numeric type at to_timestamp. We should allow 
it back:

{code}
spark.range(1).selectExpr("to_timestamp(id)").show()
{code}


**Before**

{code}
+---+
|   to_timestamp(id)|
+---+
|1970-01-01 09:00:00|
+---+
{code}


**After**

{code}
+-+
| to_timestamp(id)|
+-+
| null|
+-+
{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38583) to_timestamp should allow numeric types

2022-03-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38583:
-
Description: 
SPARK-38240 mistakenly disallowed numeric type at to_timestamp. We should allow 
it back:

{code}
spark.range(1).selectExpr("to_timestamp(id)").show()
{code}


*Before*

{code}
+---+
|   to_timestamp(id)|
+---+
|1970-01-01 09:00:00|
+---+
{code}


*After*

{code}
+-+
| to_timestamp(id)|
+-+
| null|
+-+
{code}


  was:
SPARK-38240 mistakenly disallowed numeric type at to_timestamp. We should allow 
it back:

{code}
spark.range(1).selectExpr("to_timestamp(id)").show()
{code}


**Before**

{code}
+---+
|   to_timestamp(id)|
+---+
|1970-01-01 09:00:00|
+---+
{code}


**After**

{code}
+-+
| to_timestamp(id)|
+-+
| null|
+-+
{code}



> to_timestamp should allow numeric types
> ---
>
> Key: SPARK-38583
> URL: https://issues.apache.org/jira/browse/SPARK-38583
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-38240 mistakenly disallowed numeric type at to_timestamp. We should 
> allow it back:
> {code}
> spark.range(1).selectExpr("to_timestamp(id)").show()
> {code}
> *Before*
> {code}
> +---+
> |   to_timestamp(id)|
> +---+
> |1970-01-01 09:00:00|
> +---+
> {code}
> *After*
> {code}
> +-+
> | to_timestamp(id)|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38582) Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for `KubernetesUtils` to eliminate duplicate code pattern

2022-03-16 Thread qian (Jira)

qian created SPARK-38582:


 Summary: Introduce `buildEnvVarsWithKV` and 
`buildEnvVarsWithFieldRef` for `KubernetesUtils` to eliminate duplicate code 
pattern
 Key: SPARK-38582
 URL: https://issues.apache.org/jira/browse/SPARK-38582
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.2.1
Reporter: qian


There are many duplicate code patterns in Spark Code:
{code:java}
new EnvVarBuilder()
  .withName(key)
  .withValue(value)
  .build() {code}
{code:java}
new EnvVarBuilder()
   .withName(name)
 .withValueFrom(new EnvVarSourceBuilder()
   .withNewFieldRef(version, field)
   .build())
   .build()





{code}
 

[The assignment statement for executor envVar | 
https://github.com/apache/spark/blob/branch-3.3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L123-L185]
 has 63 lines.  We could introduce _buildEnvVarsWithKV_ and 
_buildEnvVarsWithFieldRef_ function to simplify the above code patterns.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38557) What may be a cause for HDFSMetadataCommitter: Error while fetching MetaData and how to fix or work around this?

2022-03-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38557.
--
Resolution: Invalid

> What may be a cause for HDFSMetadataCommitter: Error while fetching MetaData 
> and how to fix or work around this?
> 
>
> Key: SPARK-38557
> URL: https://issues.apache.org/jira/browse/SPARK-38557
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1
> AWS EMR 6.3.0
> python 3.7.2
>Reporter: Dmitry Goldenberg
>Priority: Major
>
> I'm seeing errors such as the below when executing structured Spark Streaming 
> app which streams data from AWS Kinesis.
>  
> I've googled the error but can't tell what may be the cause. Is Spark running 
> out of disk space? something else?
> {code:java}
> // From the stderr log in EMR
> 22/03/15 00:54:00 WARN HDFSMetadataCommitter: Error while fetching MetaData 
> [attempt = 1]
> java.lang.IllegalStateException: 
> hdfs://ip-10-2-XXX-XXX.awsinternal.acme.com:8020/mnt/tmp/temporary-03b8fecf-32d5-422c-9375-4c3450ed0bb8/sources/0/shard-commit/0
>  does not exist
>     at 
> org.apache.spark.sql.kinesis.HDFSMetadataCommitter.$anonfun$get$1(HDFSMetadataCommitter.scala:163)
>     at 
> org.apache.spark.sql.kinesis.HDFSMetadataCommitter.withRetry(HDFSMetadataCommitter.scala:229)
>     at 
> org.apache.spark.sql.kinesis.HDFSMetadataCommitter.get(HDFSMetadataCommitter.scala:151)
>     at 
> org.apache.spark.sql.kinesis.KinesisSource.prevBatchShardInfo(KinesisSource.scala:275)
>     at 
> org.apache.spark.sql.kinesis.KinesisSource.getOffset(KinesisSource.scala:163)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$6(MicroBatchExecution.scala:399)
>     at 
> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
>     at 
> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
>     at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:399)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>     at scala.collection.immutable.Map$Map1.foreach(Map.scala:128)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>     at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:382)
>     at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:613)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:378)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:211)
>     at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>     at 
> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
>     at 
> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
>     at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:194)
>     at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:188)
>     at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:333)
>     at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:244){code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38557) What may be a cause for HDFSMetadataCommitter: Error while fetching MetaData and how to fix or work around this?

2022-03-16 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507976#comment-17507976
 ] 

Hyukjin Kwon commented on SPARK-38557:
--

For questions, we should better interact with Spark mailing list. Let's file an 
issue after having a discussion there first.

> What may be a cause for HDFSMetadataCommitter: Error while fetching MetaData 
> and how to fix or work around this?
> 
>
> Key: SPARK-38557
> URL: https://issues.apache.org/jira/browse/SPARK-38557
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1
> AWS EMR 6.3.0
> python 3.7.2
>Reporter: Dmitry Goldenberg
>Priority: Major
>
> I'm seeing errors such as the below when executing structured Spark Streaming 
> app which streams data from AWS Kinesis.
>  
> I've googled the error but can't tell what may be the cause. Is Spark running 
> out of disk space? something else?
> {code:java}
> // From the stderr log in EMR
> 22/03/15 00:54:00 WARN HDFSMetadataCommitter: Error while fetching MetaData 
> [attempt = 1]
> java.lang.IllegalStateException: 
> hdfs://ip-10-2-XXX-XXX.awsinternal.acme.com:8020/mnt/tmp/temporary-03b8fecf-32d5-422c-9375-4c3450ed0bb8/sources/0/shard-commit/0
>  does not exist
>     at 
> org.apache.spark.sql.kinesis.HDFSMetadataCommitter.$anonfun$get$1(HDFSMetadataCommitter.scala:163)
>     at 
> org.apache.spark.sql.kinesis.HDFSMetadataCommitter.withRetry(HDFSMetadataCommitter.scala:229)
>     at 
> org.apache.spark.sql.kinesis.HDFSMetadataCommitter.get(HDFSMetadataCommitter.scala:151)
>     at 
> org.apache.spark.sql.kinesis.KinesisSource.prevBatchShardInfo(KinesisSource.scala:275)
>     at 
> org.apache.spark.sql.kinesis.KinesisSource.getOffset(KinesisSource.scala:163)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$6(MicroBatchExecution.scala:399)
>     at 
> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
>     at 
> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
>     at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$2(MicroBatchExecution.scala:399)
>     at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>     at scala.collection.immutable.Map$Map1.foreach(Map.scala:128)
>     at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>     at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>     at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$constructNextBatch$1(MicroBatchExecution.scala:382)
>     at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:613)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.constructNextBatch(MicroBatchExecution.scala:378)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:211)
>     at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>     at 
> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
>     at 
> org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
>     at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:194)
>     at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:188)
>     at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:333)
>     at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:244){code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38578) Avoid unnecessary sort in FileFormatWriter if user has specified sort in AQE

2022-03-16 Thread XiDuo You (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-38578:
--
Summary: Avoid unnecessary sort in FileFormatWriter if user has specified 
sort in AQE  (was: Avoid unnecessary sort in FileFormatWriter if user has 
specified sort)

> Avoid unnecessary sort in FileFormatWriter if user has specified sort in AQE
> 
>
> Key: SPARK-38578
> URL: https://issues.apache.org/jira/browse/SPARK-38578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> FileFormatWriter will check and add an implicit sort for dynamic partition 
> columns or bucket columns according to the input physical plan. The check 
> became always failure since AQE AdaptiveSparkPlanExec has no outputOrdering.
> That casues a redundant sort if user has specified a sort which satisfies the 
> required ordering (dynamic partition and bucket columns).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38581) List of supported pandas APIs for pandas API on Spark docs.

2022-03-16 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-38581:

Issue Type: Documentation  (was: Bug)

> List of supported pandas APIs for pandas API on Spark docs.
> ---
>
> Key: SPARK-38581
> URL: https://issues.apache.org/jira/browse/SPARK-38581
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> In Modin, they have a [list of supported pandas 
> API|https://modin.readthedocs.io/en/stable/supported_apis/dataframe_supported.html].
> It would be great if we have also supported pandas API list so that users can 
> easily find which API is available or not in pandas API on Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38581) List of supported pandas APIs for pandas API on Spark docs.

2022-03-16 Thread Haejoon Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507962#comment-17507962
 ] 

Haejoon Lee commented on SPARK-38581:
-

I'm working on it

> List of supported pandas APIs for pandas API on Spark docs.
> ---
>
> Key: SPARK-38581
> URL: https://issues.apache.org/jira/browse/SPARK-38581
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> In Modin, they have a [list of supported pandas 
> API|https://modin.readthedocs.io/en/stable/supported_apis/dataframe_supported.html].
> It would be great if we have also supported pandas API list so that users can 
> easily find which API is available or not in pandas API on Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38581) List of supported pandas APIs for pandas API on Spark docs.

2022-03-16 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-38581:
---

 Summary: List of supported pandas APIs for pandas API on Spark 
docs.
 Key: SPARK-38581
 URL: https://issues.apache.org/jira/browse/SPARK-38581
 Project: Spark
  Issue Type: Bug
  Components: Documentation, PySpark
Affects Versions: 3.3.0
Reporter: Haejoon Lee


In Modin, they have a [list of supported pandas 
API|https://modin.readthedocs.io/en/stable/supported_apis/dataframe_supported.html].

It would be great if we have also supported pandas API list so that users can 
easily find which API is available or not in pandas API on Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38579) Requesting Restful API can cause NullPointerException

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507958#comment-17507958
 ] 

Apache Spark commented on SPARK-38579:
--

User 'yym1995' has created a pull request for this issue:
https://github.com/apache/spark/pull/35884

> Requesting Restful API can cause NullPointerException
> -
>
> Key: SPARK-38579
> URL: https://issues.apache.org/jira/browse/SPARK-38579
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.2.1
>Reporter: Yimin Yang
>Priority: Major
>
> When requesting Restful API  
> \{baseURL}/api/v1/applications/$appId/sql/$executionId which is introduced by 
> this PR [https://github.com/apache/spark/pull/28208,] it can cause 
> NullPointerException. The root cause is, when calling method doUpdate() of 
> `LiveExecutionData`, `metricsValues` can be null. Then, when statement 
> `printableMetrics(graph.allNodes, exec.metricValues)` is executed, it will 
> throw  NullPointerException.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38579) Requesting Restful API can cause NullPointerException

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38579:


Assignee: Apache Spark

> Requesting Restful API can cause NullPointerException
> -
>
> Key: SPARK-38579
> URL: https://issues.apache.org/jira/browse/SPARK-38579
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.2.1
>Reporter: Yimin Yang
>Assignee: Apache Spark
>Priority: Major
>
> When requesting Restful API  
> \{baseURL}/api/v1/applications/$appId/sql/$executionId which is introduced by 
> this PR [https://github.com/apache/spark/pull/28208,] it can cause 
> NullPointerException. The root cause is, when calling method doUpdate() of 
> `LiveExecutionData`, `metricsValues` can be null. Then, when statement 
> `printableMetrics(graph.allNodes, exec.metricValues)` is executed, it will 
> throw  NullPointerException.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38579) Requesting Restful API can cause NullPointerException

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38579:


Assignee: (was: Apache Spark)

> Requesting Restful API can cause NullPointerException
> -
>
> Key: SPARK-38579
> URL: https://issues.apache.org/jira/browse/SPARK-38579
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.2.1
>Reporter: Yimin Yang
>Priority: Major
>
> When requesting Restful API  
> \{baseURL}/api/v1/applications/$appId/sql/$executionId which is introduced by 
> this PR [https://github.com/apache/spark/pull/28208,] it can cause 
> NullPointerException. The root cause is, when calling method doUpdate() of 
> `LiveExecutionData`, `metricsValues` can be null. Then, when statement 
> `printableMetrics(graph.allNodes, exec.metricValues)` is executed, it will 
> throw  NullPointerException.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38580) Requesting Restful API can cause NullPointerException

2022-03-16 Thread Yimin Yang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yimin Yang resolved SPARK-38580.

Resolution: Duplicate

> Requesting Restful API can cause NullPointerException
> -
>
> Key: SPARK-38580
> URL: https://issues.apache.org/jira/browse/SPARK-38580
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.2.1
>Reporter: Yimin Yang
>Priority: Major
>
> When requesting Restful API  
> \{baseURL}/api/v1/applications/$appId/sql/$executionId which is introduced by 
> this PR [https://github.com/apache/spark/pull/28208,] it can cause 
> NullPointerException. The root cause is, when calling method doUpdate() of 
> `LiveExecutionData`, `metricsValues` can be null. Then, when statement 
> `printableMetrics(graph.allNodes, exec.metricValues)` is executed, it will 
> throw  NullPointerException.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38580) Requesting Restful API can cause NullPointerException

2022-03-16 Thread Yimin Yang (Jira)

Yimin Yang created SPARK-38580:
--

 Summary: Requesting Restful API can cause NullPointerException
 Key: SPARK-38580
 URL: https://issues.apache.org/jira/browse/SPARK-38580
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.2.1, 3.2.0, 3.1.2, 3.1.1, 3.1.0
Reporter: Yimin Yang


When requesting Restful API  
\{baseURL}/api/v1/applications/$appId/sql/$executionId which is introduced by 
this PR [https://github.com/apache/spark/pull/28208,] it can cause 
NullPointerException. The root cause is, when calling method doUpdate() of 
`LiveExecutionData`, `metricsValues` can be null. Then, when statement 
`printableMetrics(graph.allNodes, exec.metricValues)` is executed, it will 
throw  NullPointerException.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38579) Requesting Restful API can cause NullPointerException

2022-03-16 Thread Yimin Yang (Jira)

Yimin Yang created SPARK-38579:
--

 Summary: Requesting Restful API can cause NullPointerException
 Key: SPARK-38579
 URL: https://issues.apache.org/jira/browse/SPARK-38579
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.2.1, 3.2.0, 3.1.2, 3.1.1, 3.1.0
Reporter: Yimin Yang


When requesting Restful API  
\{baseURL}/api/v1/applications/$appId/sql/$executionId which is introduced by 
this PR [https://github.com/apache/spark/pull/28208,] it can cause 
NullPointerException. The root cause is, when calling method doUpdate() of 
`LiveExecutionData`, `metricsValues` can be null. Then, when statement 
`printableMetrics(graph.allNodes, exec.metricValues)` is executed, it will 
throw  NullPointerException.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38575) Duduplicate branch specification in GitHub Actions workflow

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507946#comment-17507946
 ] 

Apache Spark commented on SPARK-38575:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35883

> Duduplicate branch specification in GitHub Actions workflow
> ---
>
> Key: SPARK-38575
> URL: https://issues.apache.org/jira/browse/SPARK-38575
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently we should make some changes everytime we make branch like 
> https://github.com/apache/spark/pull/35876. We should ideally make it 
> automatically working without making such change.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38578) Avoid unnecessary sort in FileFormatWriter if user has specified sort

2022-03-16 Thread XiDuo You (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-38578:
--
Parent: SPARK-37063
Issue Type: Sub-task  (was: Improvement)

> Avoid unnecessary sort in FileFormatWriter if user has specified sort
> -
>
> Key: SPARK-38578
> URL: https://issues.apache.org/jira/browse/SPARK-38578
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> FileFormatWriter will check and add an implicit sort for dynamic partition 
> columns or bucket columns according to the input physical plan. The check 
> became always failure since AQE AdaptiveSparkPlanExec has no outputOrdering.
> That casues a redundant sort if user has specified a sort which satisfies the 
> required ordering (dynamic partition and bucket columns).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38578) Avoid unnecessary sort in FileFormatWriter if user has specified sort

2022-03-16 Thread XiDuo You (Jira)

XiDuo You created SPARK-38578:
-

 Summary: Avoid unnecessary sort in FileFormatWriter if user has 
specified sort
 Key: SPARK-38578
 URL: https://issues.apache.org/jira/browse/SPARK-38578
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: XiDuo You


FileFormatWriter will check and add an implicit sort for dynamic partition 
columns or bucket columns according to the input physical plan. The check 
became always failure since AQE AdaptiveSparkPlanExec has no outputOrdering.

That casues a redundant sort if user has specified a sort which satisfies the 
required ordering (dynamic partition and bucket columns).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38575) Duduplicate branch specification in GitHub Actions workflow

2022-03-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38575:


Assignee: Hyukjin Kwon

> Duduplicate branch specification in GitHub Actions workflow
> ---
>
> Key: SPARK-38575
> URL: https://issues.apache.org/jira/browse/SPARK-38575
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently we should make some changes everytime we make branch like 
> https://github.com/apache/spark/pull/35876. We should ideally make it 
> automatically working without making such change.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38575) Duduplicate branch specification in GitHub Actions workflow

2022-03-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38575.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35882
[https://github.com/apache/spark/pull/35882]

> Duduplicate branch specification in GitHub Actions workflow
> ---
>
> Key: SPARK-38575
> URL: https://issues.apache.org/jira/browse/SPARK-38575
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently we should make some changes everytime we make branch like 
> https://github.com/apache/spark/pull/35876. We should ideally make it 
> automatically working without making such change.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2

2022-03-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-38568:

Reporter: Yuming Wang  (was: Dongjoon Hyun)

> Upgrade ZSTD-JNI to 1.5.2-2
> ---
>
> Key: SPARK-38568
> URL: https://issues.apache.org/jira/browse/SPARK-38568
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2

2022-03-16 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507942#comment-17507942
 ] 

Yuming Wang commented on SPARK-38568:
-

OK

> Upgrade ZSTD-JNI to 1.5.2-2
> ---
>
> Key: SPARK-38568
> URL: https://issues.apache.org/jira/browse/SPARK-38568
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38577) Interval types are not truncated to the expected endField when creating a DataFrame via Duration

2022-03-16 Thread chong (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chong updated SPARK-38577:
--
Description: 
*Problem:*

ANSI interval types are store as long internally.

The long value are not truncated to the expected endField when creating a 
DataFrame via Duration.

 

*Reproduce:*

Create a "day to day" interval, the seconds are not truncated, see below code.

The internal long is not {*}86400 * 100{*}, but it's ({*}86400 + 1)  * 
100{*}{*}{{*}}

 
{code:java}
  test("my test") {
val data = Seq(Row(Duration.ofDays(1).plusSeconds(1)))
val schema = StructType(Array(
  StructField("t", DayTimeIntervalType(DayTimeIntervalType.DAY, 
DayTimeIntervalType.DAY))
))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.show()
  } {code}
 

 

After debug, the {{endField}} is always {{SECOND}} in {{{}durationToMicros{}}}, 
see below:

 
{code:java}
  // IntervalUtils class

  def durationToMicros(duration: Duration): Long = {
durationToMicros(duration, DT.SECOND)   // always SECOND
  }

  def durationToMicros(duration: Duration, endField: Byte)

{code}
Seems should use different endField which could be [DAY, HOUR, MINUTE, SECOND]

Or Spark can throw an exception to avoid truncating.

  was:
*Problem:*

ANSI interval types are store as long internally.

The long value are not truncated to the expected endField when creating a 
DataFrame via Duration.

 

*Reproduce:*

Create a "day to day" interval, the seconds are not truncated, see below code.

The internal long is not {*}86400 * 100{*}, but it's ({*}86400 + 1)  * 
100{*}{*}{*}

 
{code:java}
  test("my test") {
val data = Seq(Row(Duration.ofDays(1).plusSeconds(1)))
val schema = StructType(Array(
  StructField("t", DayTimeIntervalType(DayTimeIntervalType.DAY, 
DayTimeIntervalType.DAY))
))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.show()
  } {code}
 

 

After debug, the {{endField}} is always {{SECOND}} in {{{}durationToMicros{}}}, 
see below:

 
{code:java}
  // IntervalUtils class

  def durationToMicros(duration: Duration): Long = {
durationToMicros(duration, DT.SECOND)   // always SECOND
  }

  def durationToMicros(duration: Duration, endField: Byte)

{code}
Seems should use different endField which could be [DAY, HOUR, MINUTE, SECOND]

 

 
 


> Interval types are not truncated to the expected endField when creating a 
> DataFrame via Duration
> 
>
> Key: SPARK-38577
> URL: https://issues.apache.org/jira/browse/SPARK-38577
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
> Environment: Spark 3.3.0 snapshot version
>  
>Reporter: chong
>Priority: Major
>
> *Problem:*
> ANSI interval types are store as long internally.
> The long value are not truncated to the expected endField when creating a 
> DataFrame via Duration.
>  
> *Reproduce:*
> Create a "day to day" interval, the seconds are not truncated, see below code.
> The internal long is not {*}86400 * 100{*}, but it's ({*}86400 + 1)  * 
> 100{*}{*}{{*}}
>  
> {code:java}
>   test("my test") {
> val data = Seq(Row(Duration.ofDays(1).plusSeconds(1)))
> val schema = StructType(Array(
>   StructField("t", DayTimeIntervalType(DayTimeIntervalType.DAY, 
> DayTimeIntervalType.DAY))
> ))
> val df = spark.createDataFrame(spark.sparkContext.parallelize(data), 
> schema)
> df.show()
>   } {code}
>  
>  
> After debug, the {{endField}} is always {{SECOND}} in 
> {{{}durationToMicros{}}}, see below:
>  
> {code:java}
>   // IntervalUtils class
>   def durationToMicros(duration: Duration): Long = {
> durationToMicros(duration, DT.SECOND)   // always SECOND
>   }
>   def durationToMicros(duration: Duration, endField: Byte)
> {code}
> Seems should use different endField which could be [DAY, HOUR, MINUTE, SECOND]
> Or Spark can throw an exception to avoid truncating.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38520) Overflow occurs when reading ANSI day time interval from CSV file

2022-03-16 Thread chong (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chong updated SPARK-38520:
--
Description: 
*Problem:*

Overflow occurs when reading the following positive intervals, the results 
become to negative

interval '106751992' day     => INTERVAL '-106751990' DAY

INTERVAL +'+2562047789' hour => INTERVAL '-2562047787' HOUR

interval '153722867281' minute => INTERVAL '-153722867280' MINUTE

 

*Reproduce:*
{code:java}
// days overflow
 scala> val schema = StructType(Seq(StructField("c1",
   DayTimeIntervalType(DayTimeIntervalType.DAY, DayTimeIntervalType.DAY
 scala> spark.read.csv(path).show(false)
 ++
 |_c0                     |
 ++
 |interval '106751992' day|
 ++
 scala> spark.read.schema(schema).csv(path).show(false)
 +-+
 |c1                       |
 +-+
 |INTERVAL '-106751990' DAY|
 +-+
  // hour overflow
 scala> val schema = StructType(Seq(StructField("c1",
   DayTimeIntervalType(DayTimeIntervalType.HOUR, DayTimeIntervalType.HOUR
 scala> spark.read.csv(path).show(false)
 ++
 |_c0                         |
 ++
 |INTERVAL +'+2562047789' hour|
 ++
 scala> spark.read.schema(schema).csv(path).show(false)
 +---+
 |c1                         |
 +---+
 |INTERVAL '-2562047787' HOUR|
 +---+
 // minute overflow
 scala> val schema = StructType(Seq(StructField("c1",
   DayTimeIntervalType(DayTimeIntervalType.MINUTE, 
DayTimeIntervalType.MINUTE
 scala> spark.read.csv(path).show(false)
 +--+
 |_c0                           |
 +--+
 |interval '153722867281' minute|
 +--+
 scala> spark.read.schema(schema).csv(path).show(false)
 +---+
 |c1                             |
 +---+
 |INTERVAL '-153722867280' MINUTE|
 +---+
{code}
 

*others:*

Also check the negative value is read to positive.

 

others:

should check the negative also, 

  was:
*Problem:*

Overflow occurs when reading the following positive intervals, the results 
become to negative

interval '106751992' day     => INTERVAL '-106751990' DAY

INTERVAL +'+2562047789' hour => INTERVAL '-2562047787' HOUR

interval '153722867281' minute => INTERVAL '-153722867280' MINUTE

 

*Reroduce:*

{code}
// days overflow
 scala> val schema = StructType(Seq(StructField("c1",
   DayTimeIntervalType(DayTimeIntervalType.DAY, DayTimeIntervalType.DAY
 scala> spark.read.csv(path).show(false)
 ++
 |_c0                     |
 ++
 |interval '106751992' day|
 ++
 scala> spark.read.schema(schema).csv(path).show(false)
 +-+
 |c1                       |
 +-+
 |INTERVAL '-106751990' DAY|
 +-+
  // hour overflow
 scala> val schema = StructType(Seq(StructField("c1",
   DayTimeIntervalType(DayTimeIntervalType.HOUR, DayTimeIntervalType.HOUR
 scala> spark.read.csv(path).show(false)
 ++
 |_c0                         |
 ++
 |INTERVAL +'+2562047789' hour|
 ++
 scala> spark.read.schema(schema).csv(path).show(false)
 +---+
 |c1                         |
 +---+
 |INTERVAL '-2562047787' HOUR|
 +---+
 // minute overflow
 scala> val schema = StructType(Seq(StructField("c1",
   DayTimeIntervalType(DayTimeIntervalType.MINUTE, 
DayTimeIntervalType.MINUTE
 scala> spark.read.csv(path).show(false)
 +--+
 |_c0                           |
 +--+
 |interval '153722867281' minute|
 +--+
 scala> spark.read.schema(schema).csv(path).show(false)
 +---+
 |c1                             |
 +---+
 |INTERVAL '-153722867280' MINUTE|
 +---+
{code} 

*others:*

Also check the negative value is read to positive.

 

others:

should check the negative also, 


> Overflow occurs when reading ANSI day time interval from CSV file
> -
>
> Key: SPARK-38520
> URL: https://issues.apache.org/jira/browse/SPARK-38520
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: chong
>Priority: Major
>
> *Problem:*
> Overflow occurs when reading the following positive intervals, the results 
> become to n

[jira] [Updated] (SPARK-38520) Overflow occurs when reading ANSI day time interval from CSV file

2022-03-16 Thread chong (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chong updated SPARK-38520:
--
Description: 
*Problem:*

Overflow occurs when reading the following positive intervals, the results 
become to negative

interval '106751992' day     => INTERVAL '-106751990' DAY

INTERVAL +'+2562047789' hour => INTERVAL '-2562047787' HOUR

interval '153722867281' minute => INTERVAL '-153722867280' MINUTE

 

*Reproduce:*
{code:java}
// days overflow
 scala> val schema = StructType(Seq(StructField("c1",
   DayTimeIntervalType(DayTimeIntervalType.DAY, DayTimeIntervalType.DAY
 scala> spark.read.csv(path).show(false)
 ++
 |_c0                     |
 ++
 |interval '106751992' day|
 ++
 scala> spark.read.schema(schema).csv(path).show(false)
 +-+
 |c1                       |
 +-+
 |INTERVAL '-106751990' DAY|
 +-+
  // hour overflow
 scala> val schema = StructType(Seq(StructField("c1",
   DayTimeIntervalType(DayTimeIntervalType.HOUR, DayTimeIntervalType.HOUR
 scala> spark.read.csv(path).show(false)
 ++
 |_c0                         |
 ++
 |INTERVAL +'+2562047789' hour|
 ++
 scala> spark.read.schema(schema).csv(path).show(false)
 +---+
 |c1                         |
 +---+
 |INTERVAL '-2562047787' HOUR|
 +---+
 // minute overflow
 scala> val schema = StructType(Seq(StructField("c1",
   DayTimeIntervalType(DayTimeIntervalType.MINUTE, 
DayTimeIntervalType.MINUTE
 scala> spark.read.csv(path).show(false)
 +--+
 |_c0                           |
 +--+
 |interval '153722867281' minute|
 +--+
 scala> spark.read.schema(schema).csv(path).show(false)
 +---+
 |c1                             |
 +---+
 |INTERVAL '-153722867280' MINUTE|
 +---+
{code}
 

*others:*

Also check the negative value is read to positive.

 

others:

should check the negative also

  was:
*Problem:*

Overflow occurs when reading the following positive intervals, the results 
become to negative

interval '106751992' day     => INTERVAL '-106751990' DAY

INTERVAL +'+2562047789' hour => INTERVAL '-2562047787' HOUR

interval '153722867281' minute => INTERVAL '-153722867280' MINUTE

 

*Reproduce:*
{code:java}
// days overflow
 scala> val schema = StructType(Seq(StructField("c1",
   DayTimeIntervalType(DayTimeIntervalType.DAY, DayTimeIntervalType.DAY
 scala> spark.read.csv(path).show(false)
 ++
 |_c0                     |
 ++
 |interval '106751992' day|
 ++
 scala> spark.read.schema(schema).csv(path).show(false)
 +-+
 |c1                       |
 +-+
 |INTERVAL '-106751990' DAY|
 +-+
  // hour overflow
 scala> val schema = StructType(Seq(StructField("c1",
   DayTimeIntervalType(DayTimeIntervalType.HOUR, DayTimeIntervalType.HOUR
 scala> spark.read.csv(path).show(false)
 ++
 |_c0                         |
 ++
 |INTERVAL +'+2562047789' hour|
 ++
 scala> spark.read.schema(schema).csv(path).show(false)
 +---+
 |c1                         |
 +---+
 |INTERVAL '-2562047787' HOUR|
 +---+
 // minute overflow
 scala> val schema = StructType(Seq(StructField("c1",
   DayTimeIntervalType(DayTimeIntervalType.MINUTE, 
DayTimeIntervalType.MINUTE
 scala> spark.read.csv(path).show(false)
 +--+
 |_c0                           |
 +--+
 |interval '153722867281' minute|
 +--+
 scala> spark.read.schema(schema).csv(path).show(false)
 +---+
 |c1                             |
 +---+
 |INTERVAL '-153722867280' MINUTE|
 +---+
{code}
 

*others:*

Also check the negative value is read to positive.

 

others:

should check the negative also, 


> Overflow occurs when reading ANSI day time interval from CSV file
> -
>
> Key: SPARK-38520
> URL: https://issues.apache.org/jira/browse/SPARK-38520
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: chong
>Priority: Major
>
> *Problem:*
> Overflow occurs when reading the following positive intervals, the results 
> become

[jira] [Assigned] (SPARK-38576) Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38576:


Assignee: (was: Apache Spark)

> Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank 
> numeric columns only
> ---
>
> Key: SPARK-38576
> URL: https://issues.apache.org/jira/browse/SPARK-38576
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank 
> numeric columns only.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38576) Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38576:


Assignee: Apache Spark

> Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank 
> numeric columns only
> ---
>
> Key: SPARK-38576
> URL: https://issues.apache.org/jira/browse/SPARK-38576
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank 
> numeric columns only.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38576) Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507931#comment-17507931
 ] 

Apache Spark commented on SPARK-38576:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/35868

> Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank 
> numeric columns only
> ---
>
> Key: SPARK-38576
> URL: https://issues.apache.org/jira/browse/SPARK-38576
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank 
> numeric columns only.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38576) Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507930#comment-17507930
 ] 

Apache Spark commented on SPARK-38576:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/35868

> Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank 
> numeric columns only
> ---
>
> Key: SPARK-38576
> URL: https://issues.apache.org/jira/browse/SPARK-38576
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank 
> numeric columns only.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38577) Interval types are not truncated to the expected endField when creating a DataFrame via Duration

2022-03-16 Thread chong (Jira)

chong created SPARK-38577:
-

 Summary: Interval types are not truncated to the expected endField 
when creating a DataFrame via Duration
 Key: SPARK-38577
 URL: https://issues.apache.org/jira/browse/SPARK-38577
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
 Environment: Spark 3.3.0 snapshot version

 
Reporter: chong


*Problem:*

ANSI interval types are store as long internally.

The long value are not truncated to the expected endField when creating a 
DataFrame via Duration.

 

*Reproduce:*

Create a "day to day" interval, the seconds are not truncated, see below code.

The internal long is not {*}86400 * 100{*}, but it's ({*}86400 + 1)  * 
100{*}{*}{*}

 
{code:java}
  test("my test") {
val data = Seq(Row(Duration.ofDays(1).plusSeconds(1)))
val schema = StructType(Array(
  StructField("t", DayTimeIntervalType(DayTimeIntervalType.DAY, 
DayTimeIntervalType.DAY))
))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.show()
  } {code}
 

 

After debug, the {{endField}} is always {{SECOND}} in {{{}durationToMicros{}}}, 
see below:

 
{code:java}
  // IntervalUtils class

  def durationToMicros(duration: Duration): Long = {
durationToMicros(duration, DT.SECOND)   // always SECOND
  }

  def durationToMicros(duration: Duration, endField: Byte)

{code}
Seems should use different endField which could be [DAY, HOUR, MINUTE, SECOND]

 

 
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38576) Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric columns only

2022-03-16 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-38576:


 Summary: Implement `numeric_only` parameter for 
`DataFrame/Series.rank` to rank numeric columns only
 Key: SPARK-38576
 URL: https://issues.apache.org/jira/browse/SPARK-38576
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `numeric_only` parameter for `DataFrame/Series.rank` to rank numeric 
columns only.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38556) Disable Pandas usage logging for method calls inside @contextmanager functions

2022-03-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38556.
--
Target Version/s: 3.2.1, 3.3.0
  Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/35861

> Disable Pandas usage logging for method calls inside @contextmanager functions
> --
>
> Key: SPARK-38556
> URL: https://issues.apache.org/jira/browse/SPARK-38556
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Yihong He
>Priority: Minor
>
> Currently, calls inside @contextmanager functions are treated as external for 
> *with* statements.
> For example, the below code records config.set_option calls inside 
> ps.option_context(...)
> {code:java}
> with ps.option_context("compute.ops_on_diff_frames", True):
> pass {code}
> We should disable usage logging for calls inside @contextmanager functions to 
> improve accuracy of the usage data
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38441) Support string and bool `regex` in `Series.replace`

2022-03-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38441.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35747
[https://github.com/apache/spark/pull/35747]

> Support string and bool `regex` in `Series.replace`
> ---
>
> Key: SPARK-38441
> URL: https://issues.apache.org/jira/browse/SPARK-38441
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> Support string and bool `regex` in `Series.replace` in order to reach parity 
> with pandas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38441) Support string and bool `regex` in `Series.replace`

2022-03-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38441:


Assignee: Xinrong Meng

> Support string and bool `regex` in `Series.replace`
> ---
>
> Key: SPARK-38441
> URL: https://issues.apache.org/jira/browse/SPARK-38441
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Support string and bool `regex` in `Series.replace` in order to reach parity 
> with pandas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT

2022-03-16 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38572.
---
Resolution: Fixed

Issue resolved by pull request 35879
[https://github.com/apache/spark/pull/35879]

> Setting version to 3.4.0-SNAPSHOT
> -
>
> Key: SPARK-38572
> URL: https://issues.apache.org/jira/browse/SPARK-38572
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38575) Duduplicate branch specification in GitHub Actions workflow

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507921#comment-17507921
 ] 

Apache Spark commented on SPARK-38575:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35882

> Duduplicate branch specification in GitHub Actions workflow
> ---
>
> Key: SPARK-38575
> URL: https://issues.apache.org/jira/browse/SPARK-38575
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently we should make some changes everytime we make branch like 
> https://github.com/apache/spark/pull/35876. We should ideally make it 
> automatically working without making such change.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38575) Duduplicate branch specification in GitHub Actions workflow

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38575:


Assignee: (was: Apache Spark)

> Duduplicate branch specification in GitHub Actions workflow
> ---
>
> Key: SPARK-38575
> URL: https://issues.apache.org/jira/browse/SPARK-38575
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently we should make some changes everytime we make branch like 
> https://github.com/apache/spark/pull/35876. We should ideally make it 
> automatically working without making such change.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38575) Duduplicate branch specification in GitHub Actions workflow

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38575:


Assignee: Apache Spark

> Duduplicate branch specification in GitHub Actions workflow
> ---
>
> Key: SPARK-38575
> URL: https://issues.apache.org/jira/browse/SPARK-38575
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently we should make some changes everytime we make branch like 
> https://github.com/apache/spark/pull/35876. We should ideally make it 
> automatically working without making such change.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38575) Duduplicate branch specification in GitHub Actions workflow

2022-03-16 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-38575:


 Summary: Duduplicate branch specification in GitHub Actions 
workflow
 Key: SPARK-38575
 URL: https://issues.apache.org/jira/browse/SPARK-38575
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


Currently we should make some changes everytime we make branch like 
https://github.com/apache/spark/pull/35876. We should ideally make it 
automatically working without making such change.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2

2022-03-16 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38568:
--
Affects Version/s: 3.4.0
   (was: 3.3.0)

> Upgrade ZSTD-JNI to 1.5.2-2
> ---
>
> Key: SPARK-38568
> URL: https://issues.apache.org/jira/browse/SPARK-38568
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2

2022-03-16 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507907#comment-17507907
 ] 

Dongjoon Hyun commented on SPARK-38568:
---

Hi, [~yumwang]. When you clone an issue, you should change the `Reporter`. :)
For this JIRA, I didn't report this JIRA. It should be you because you created 
this. Could fix it?

> Upgrade ZSTD-JNI to 1.5.2-2
> ---
>
> Key: SPARK-38568
> URL: https://issues.apache.org/jira/browse/SPARK-38568
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36664) Log time spent waiting for cluster resources

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507899#comment-17507899
 ] 

Apache Spark commented on SPARK-36664:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/35881

> Log time spent waiting for cluster resources
> 
>
> Key: SPARK-36664
> URL: https://issues.apache.org/jira/browse/SPARK-36664
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Holden Karau
>Priority: Major
>
> To provide better visibility into why jobs might be running slow it would be 
> useful to log when we are waiting for resources and how long we are waiting 
> for resources so if there is an underlying cluster issue the user can be 
> aware.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36664) Log time spent waiting for cluster resources

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507900#comment-17507900
 ] 

Apache Spark commented on SPARK-36664:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/35881

> Log time spent waiting for cluster resources
> 
>
> Key: SPARK-36664
> URL: https://issues.apache.org/jira/browse/SPARK-36664
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Holden Karau
>Priority: Major
>
> To provide better visibility into why jobs might be running slow it would be 
> useful to log when we are waiting for resources and how long we are waiting 
> for resources so if there is an underlying cluster issue the user can be 
> aware.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38574) Enrich Avro data source documentation

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38574:


Assignee: Apache Spark

> Enrich Avro data source documentation
> -
>
> Key: SPARK-38574
> URL: https://issues.apache.org/jira/browse/SPARK-38574
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Tianhan Hu
>Assignee: Apache Spark
>Priority: Minor
>
> Enrich Avro data source documentation to emphasize the difference between 
> *avroSchema* which is an option, and *jsonFormatSchema* which is a parameter 
> for function *from_avro* .
> When using {*}from_avro{*}, *avroSchema* option can be set to a compatible 
> and evolved schema, while *jsonFormatSchema* has to be the actual schema. 
> Elsewise, the behavior is undefined.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38574) Enrich Avro data source documentation

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38574:


Assignee: (was: Apache Spark)

> Enrich Avro data source documentation
> -
>
> Key: SPARK-38574
> URL: https://issues.apache.org/jira/browse/SPARK-38574
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Tianhan Hu
>Priority: Minor
>
> Enrich Avro data source documentation to emphasize the difference between 
> *avroSchema* which is an option, and *jsonFormatSchema* which is a parameter 
> for function *from_avro* .
> When using {*}from_avro{*}, *avroSchema* option can be set to a compatible 
> and evolved schema, while *jsonFormatSchema* has to be the actual schema. 
> Elsewise, the behavior is undefined.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38574) Enrich Avro data source documentation

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507898#comment-17507898
 ] 

Apache Spark commented on SPARK-38574:
--

User 'tianhanhu' has created a pull request for this issue:
https://github.com/apache/spark/pull/35880

> Enrich Avro data source documentation
> -
>
> Key: SPARK-38574
> URL: https://issues.apache.org/jira/browse/SPARK-38574
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Tianhan Hu
>Priority: Minor
>
> Enrich Avro data source documentation to emphasize the difference between 
> *avroSchema* which is an option, and *jsonFormatSchema* which is a parameter 
> for function *from_avro* .
> When using {*}from_avro{*}, *avroSchema* option can be set to a compatible 
> and evolved schema, while *jsonFormatSchema* has to be the actual schema. 
> Elsewise, the behavior is undefined.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38574) Enrich Avro data source documentation

2022-03-16 Thread Tianhan Hu (Jira)

Tianhan Hu created SPARK-38574:
--

 Summary: Enrich Avro data source documentation
 Key: SPARK-38574
 URL: https://issues.apache.org/jira/browse/SPARK-38574
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.1
Reporter: Tianhan Hu


Enrich Avro data source documentation to emphasize the difference between 
*avroSchema* which is an option, and *jsonFormatSchema* which is a parameter 
for function *from_avro* .

When using {*}from_avro{*}, *avroSchema* option can be set to a compatible and 
evolved schema, while *jsonFormatSchema* has to be the actual schema. Elsewise, 
the behavior is undefined.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38555) Avoid contention and get or create clientPools quickly in the TransportClientFactory

2022-03-16 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-38555.
-
  Assignee: weixiuli
Resolution: Fixed

>  Avoid contention and get or create clientPools quickly in the 
> TransportClientFactory
> -
>
> Key: SPARK-38555
> URL: https://issues.apache.org/jira/browse/SPARK-38555
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1
>Reporter: weixiuli
>Assignee: weixiuli
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38555) Avoid contention and get or create clientPools quickly in the TransportClientFactory

2022-03-16 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-38555:

Fix Version/s: 3.4.0

>  Avoid contention and get or create clientPools quickly in the 
> TransportClientFactory
> -
>
> Key: SPARK-38555
> URL: https://issues.apache.org/jira/browse/SPARK-38555
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1
>Reporter: weixiuli
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38573) Support Partition Level Statistics Collection

2022-03-16 Thread Kazuyuki Tanimura (Jira)

Kazuyuki Tanimura created SPARK-38573:
-

 Summary: Support Partition Level Statistics Collection
 Key: SPARK-38573
 URL: https://issues.apache.org/jira/browse/SPARK-38573
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Kazuyuki Tanimura


Currently https://issues.apache.org/jira/browse/SPARK-21127 supports storing 
the aggregated stats at table level for partitioned tables with config 
spark.sql.statistics.size.autoUpdate.enabled.

Supporting partition level stats are useful to know which partitions are 
outliers (skewed partition) and query optimizer works better with partition 
level stats in case of partition pruning.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38545) Upgarde scala-maven-plugin from 4.4.0 to 4.5.6

2022-03-16 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-38545.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35841
[https://github.com/apache/spark/pull/35841]

> Upgarde scala-maven-plugin from 4.4.0 to 4.5.6
> --
>
> Key: SPARK-38545
> URL: https://issues.apache.org/jira/browse/SPARK-38545
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> 4.5.6 upgrades zinc dependency to 1.5.8 and cleans up some unnecessary 
> cascading dependencies
>  
> https://github.com/davidB/scala-maven-plugin/compare/4.4.0...4.5.6



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38545) Upgarde scala-maven-plugin from 4.4.0 to 4.5.6

2022-03-16 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-38545:


Assignee: Yang Jie

> Upgarde scala-maven-plugin from 4.4.0 to 4.5.6
> --
>
> Key: SPARK-38545
> URL: https://issues.apache.org/jira/browse/SPARK-38545
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> 4.5.6 upgrades zinc dependency to 1.5.8 and cleans up some unnecessary 
> cascading dependencies
>  
> https://github.com/davidB/scala-maven-plugin/compare/4.4.0...4.5.6



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507808#comment-17507808
 ] 

Apache Spark commented on SPARK-38572:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/35879

> Setting version to 3.4.0-SNAPSHOT
> -
>
> Key: SPARK-38572
> URL: https://issues.apache.org/jira/browse/SPARK-38572
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507807#comment-17507807
 ] 

Apache Spark commented on SPARK-38572:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/35879

> Setting version to 3.4.0-SNAPSHOT
> -
>
> Key: SPARK-38572
> URL: https://issues.apache.org/jira/browse/SPARK-38572
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38572:


Assignee: Max Gekk  (was: Apache Spark)

> Setting version to 3.4.0-SNAPSHOT
> -
>
> Key: SPARK-38572
> URL: https://issues.apache.org/jira/browse/SPARK-38572
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38572:


Assignee: Apache Spark  (was: Max Gekk)

> Setting version to 3.4.0-SNAPSHOT
> -
>
> Key: SPARK-38572
> URL: https://issues.apache.org/jira/browse/SPARK-38572
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38561) Add doc for "Customized Kubernetes Schedulers"

2022-03-16 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-38561.
--
Fix Version/s: 3.3.0
 Assignee: Yikun Jiang
   Resolution: Fixed

> Add doc for "Customized Kubernetes Schedulers"
> --
>
> Key: SPARK-38561
> URL: https://issues.apache.org/jira/browse/SPARK-38561
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38503) Add warn for getAdditionalPreKubernetesResources in executor side

2022-03-16 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-38503:
-
Target Version/s: 3.4.0

> Add warn for getAdditionalPreKubernetesResources in executor side
> -
>
> Key: SPARK-38503
> URL: https://issues.apache.org/jira/browse/SPARK-38503
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT

2022-03-16 Thread Max Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507780#comment-17507780
 ] 

Max Gekk commented on SPARK-38572:
--

FYI, I am working on this.

> Setting version to 3.4.0-SNAPSHOT
> -
>
> Key: SPARK-38572
> URL: https://issues.apache.org/jira/browse/SPARK-38572
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT

2022-03-16 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-38572:
-
Reporter: Max Gekk  (was: Dongjoon Hyun)

> Setting version to 3.4.0-SNAPSHOT
> -
>
> Key: SPARK-38572
> URL: https://issues.apache.org/jira/browse/SPARK-38572
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT

2022-03-16 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-38572:
-
Fix Version/s: 3.4.0
   (was: 3.3.0)

> Setting version to 3.4.0-SNAPSHOT
> -
>
> Key: SPARK-38572
> URL: https://issues.apache.org/jira/browse/SPARK-38572
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT

2022-03-16 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-38572:
-
Affects Version/s: 3.4.0
   (was: 3.3.0)

> Setting version to 3.4.0-SNAPSHOT
> -
>
> Key: SPARK-38572
> URL: https://issues.apache.org/jira/browse/SPARK-38572
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT

2022-03-16 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38572:


Assignee: Max Gekk  (was: Dongjoon Hyun)

> Setting version to 3.4.0-SNAPSHOT
> -
>
> Key: SPARK-38572
> URL: https://issues.apache.org/jira/browse/SPARK-38572
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38572) Setting version to 3.4.0-SNAPSHOT

2022-03-16 Thread Max Gekk (Jira)

Max Gekk created SPARK-38572:


 Summary: Setting version to 3.4.0-SNAPSHOT
 Key: SPARK-38572
 URL: https://issues.apache.org/jira/browse/SPARK-38572
 Project: Spark
  Issue Type: Task
  Components: Build
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun
Assignee: Dongjoon Hyun
 Fix For: 3.3.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38194) Make Yarn memory overhead factor configurable

2022-03-16 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-38194:
--
Fix Version/s: 3.3.0
   (was: 3.4.0)

> Make Yarn memory overhead factor configurable
> -
>
> Key: SPARK-38194
> URL: https://issues.apache.org/jira/browse/SPARK-38194
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.2.1
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently if the memory overhead is not provided for a Yarn job, it defaults 
> to 10% of the respective driver/executor memory. This 10% is hard-coded and 
> the only way to increase memory overhead is to set the exact memory overhead. 
> We have seen more than 10% memory being used, and it would be helpful to be 
> able to set the default overhead factor so that the overhead doesn't need to 
> be pre-calculated for any driver/executor memory size. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38194) Make Yarn memory overhead factor configurable

2022-03-16 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-38194:
--
Fix Version/s: 3.4.0
   (was: 3.3.0)

> Make Yarn memory overhead factor configurable
> -
>
> Key: SPARK-38194
> URL: https://issues.apache.org/jira/browse/SPARK-38194
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.2.1
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently if the memory overhead is not provided for a Yarn job, it defaults 
> to 10% of the respective driver/executor memory. This 10% is hard-coded and 
> the only way to increase memory overhead is to set the exact memory overhead. 
> We have seen more than 10% memory being used, and it would be helpful to be 
> able to set the default overhead factor so that the overhead doesn't need to 
> be pre-calculated for any driver/executor memory size. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38194) Make Yarn memory overhead factor configurable

2022-03-16 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-38194.
---
Fix Version/s: 3.3.0
 Assignee: Adam Binford
   Resolution: Fixed

> Make Yarn memory overhead factor configurable
> -
>
> Key: SPARK-38194
> URL: https://issues.apache.org/jira/browse/SPARK-38194
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.2.1
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently if the memory overhead is not provided for a Yarn job, it defaults 
> to 10% of the respective driver/executor memory. This 10% is hard-coded and 
> the only way to increase memory overhead is to set the exact memory overhead. 
> We have seen more than 10% memory being used, and it would be helpful to be 
> able to set the default overhead factor so that the overhead doesn't need to 
> be pre-calculated for any driver/executor memory size. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38571) Week of month from a date is missing in spark3 for return values of 1 to 6

2022-03-16 Thread Kevin Appel (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Appel updated SPARK-38571:

Description: 
In Spark2 we could use the date_format function with either the W or F flags to 
compute week of month from a date.  These are computing two different items, 
the W is having values from 1 to 6 and the F is having values from 1 to 5

Sample code and output of expected
df1 = spark.createDataFrame(
    [
        (1, date(2014, 3, 7)),
        (2, date(2014, 3, 8)),
        (3, date(2014, 3, 30)),
        (4, date(2014, 3, 31)),
        (5, date(2015, 3, 7)),
        (6, date(2015, 3, 8)),
        (7, date(2015, 3, 30)),
        (8, date(2015, 3, 31)),
    ],
    schema="a long, b date",
)
df1 = df1.withColumn("WEEKOFMONTH1-6", F.date_format(F.col("b"), "W"))
df1 = df1.withColumn("WEEKOFMONTH1-5", F.date_format(F.col("b"), "F"))
df1.show()



{+}--{-}{-}{+}{-}++{-}{-}{-}-+                  
                                                                                
                      
| a|        b|WEEKOFMONTH1-6|WEEKOFMONTH1-5|

{+}--{-}{-}{+}{-}++{-}{-}{-}-+
| 1|2014-03-07|            2|            1|
| 2|2014-03-08|            2|            2|
| 3|2014-03-30|            6|            5|
| 4|2014-03-31|            6|            5|
| 5|2015-03-07|            1|            1|
| 6|2015-03-08|            2|            2|
| 7|2015-03-30|            5|            5|
| 8|2015-03-31|            5|            5|

{+}--{-}{-}{+}{-}++{-}{-}{-}-+

 

With the Spark3 having the spark.sql.legacy.timeParserPolicy set to 
EXCEPTION by default this throws an error:
Caused by: java.lang.IllegalArgumentException: All week-based patterns are 
unsupported since Spark 3.0, detected: W, Please use the SQL function EXTRACT 
instead
 
However from the EXTRACT function there is nothing available that is extracting 
the week of month for the values 1 to 6
 
The Spark3 mentions they define our own patterns  located at 
[https://spark.apache.org/docs/3.2.1/sql-ref-datetime-pattern.html] that are 
implemented via DateTimeFormatter under the hood: 
[https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html]
 
That site is listing both W and F for week of month
 W week-of-month number 4
 F week-of-month number 3
 
However only F is implemented on the datetime pattern reference
 
Is there another way we can compute this week of month for values 1 to 6 by 
still using the builtin with Spark3?  Currently we have to set the 
spark.sql.legacy.timeParserPolicy to LEGACY in order to run this
 
Thank you,
 
Kevin

  was:
In Spark2 we could use the date_format function with either the W or F flags to 
compute week of month from a date.  These are computing two different items, 
the W is having values from 1 to 6 and the F is having values from 1 to 5

Sample code and output of expected
``` python
df1 = spark.createDataFrame(
    [
        (1, date(2014, 3, 7)),
        (2, date(2014, 3, 8)),
        (3, date(2014, 3, 30)),
        (4, date(2014, 3, 31)),
        (5, date(2015, 3, 7)),
        (6, date(2015, 3, 8)),
        (7, date(2015, 3, 30)),
        (8, date(2015, 3, 31)),
    ],
    schema="a long, b date",
)
df1 = df1.withColumn("WEEKOFMONTH1-6", F.date_format(F.col("b"), "W"))
df1 = df1.withColumn("WEEKOFMONTH1-5", F.date_format(F.col("b"), "F"))
df1.show()
```

+---+--+--+--+                                  
                                                                                
      
|  a|         b|WEEKOFMONTH1-6|WEEKOFMONTH1-5|
+---+--+--+--+
|  1|2014-03-07|             2|             1|
|  2|2014-03-08|             2|             2|
|  3|2014-03-30|             6|             5|
|  4|2014-03-31|             6|             5|
|  5|2015-03-07|             1|             1|
|  6|2015-03-08|             2|             2|
|  7|2015-03-30|             5|             5|
|  8|2015-03-31|             5|             5|
+---+--+--+--+

 

With the Spark3 having the spark.sql.legacy.timeParserPolicy set to 
EXCEPTION by default this throws an error:
Caused by: java.lang.IllegalArgumentException: All week-based patterns are 
unsupported since Spark 3.0, detected: W, Please use the SQL function EXTRACT 
instead
 
However from the EXTRACT function there is nothing available that is extracting 
the week of month for the values 1 to 6
 
The Spark3 mentions they define our own patterns  located at 
[https://spark.apache.org/docs/3.2.1/sql-ref-datetime-pattern.html] that are 
implemented via DateTimeFormatter under the hood: 
[https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html]
 
That site is listing both W and F for week of month
 W week-of-month number 4
 F week-of-month number 3

[jira] [Created] (SPARK-38571) Week of month from a date is missing in spark3 for return values of 1 to 6

2022-03-16 Thread Kevin Appel (Jira)

Kevin Appel created SPARK-38571:
---

 Summary: Week of month from a date is missing in spark3 for return 
values of 1 to 6
 Key: SPARK-38571
 URL: https://issues.apache.org/jira/browse/SPARK-38571
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: Kevin Appel


In Spark2 we could use the date_format function with either the W or F flags to 
compute week of month from a date.  These are computing two different items, 
the W is having values from 1 to 6 and the F is having values from 1 to 5

Sample code and output of expected
``` python
df1 = spark.createDataFrame(
    [
        (1, date(2014, 3, 7)),
        (2, date(2014, 3, 8)),
        (3, date(2014, 3, 30)),
        (4, date(2014, 3, 31)),
        (5, date(2015, 3, 7)),
        (6, date(2015, 3, 8)),
        (7, date(2015, 3, 30)),
        (8, date(2015, 3, 31)),
    ],
    schema="a long, b date",
)
df1 = df1.withColumn("WEEKOFMONTH1-6", F.date_format(F.col("b"), "W"))
df1 = df1.withColumn("WEEKOFMONTH1-5", F.date_format(F.col("b"), "F"))
df1.show()
```

+---+--+--+--+                                  
                                                                                
      
|  a|         b|WEEKOFMONTH1-6|WEEKOFMONTH1-5|
+---+--+--+--+
|  1|2014-03-07|             2|             1|
|  2|2014-03-08|             2|             2|
|  3|2014-03-30|             6|             5|
|  4|2014-03-31|             6|             5|
|  5|2015-03-07|             1|             1|
|  6|2015-03-08|             2|             2|
|  7|2015-03-30|             5|             5|
|  8|2015-03-31|             5|             5|
+---+--+--+--+

 

With the Spark3 having the spark.sql.legacy.timeParserPolicy set to 
EXCEPTION by default this throws an error:
Caused by: java.lang.IllegalArgumentException: All week-based patterns are 
unsupported since Spark 3.0, detected: W, Please use the SQL function EXTRACT 
instead
 
However from the EXTRACT function there is nothing available that is extracting 
the week of month for the values 1 to 6
 
The Spark3 mentions they define our own patterns  located at 
[https://spark.apache.org/docs/3.2.1/sql-ref-datetime-pattern.html] that are 
implemented via DateTimeFormatter under the hood: 
[https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html]
 
That site is listing both W and F for week of month
 W week-of-month number 4
 F week-of-month number 3
 
However only F is implemented on the datetime pattern reference
 
Is there another way we can compute this week of month for values 1 to 6 by 
still using the builtin with Spark3?  Currently we have to set the 
spark.sql.legacy.timeParserPolicy to LEGACY in order to run this
 
Thank you,
 
Kevin



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507626#comment-17507626
 ] 

Apache Spark commented on SPARK-38570:
--

User 'mcdull-zhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35878

> Incorrect DynamicPartitionPruning caused by Literal
> ---
>
> Key: SPARK-38570
> URL: https://issues.apache.org/jira/browse/SPARK-38570
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: mcdull_zhang
>Priority: Minor
>
> The return value of Literal.references is an empty AttributeSet, so Literal 
> is mistaken for a partition column.
>  
> org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan:
> {code:java}
> val srcInfo: Option[(Expression, LogicalPlan)] = 
> findExpressionAndTrackLineageDown(a, plan)
> srcInfo.flatMap {
>   case (resExp, l: LogicalRelation) =>
> l.relation match {
>   case fs: HadoopFsRelation =>
> val partitionColumns = AttributeSet(
>   l.resolve(fs.partitionSchema, 
> fs.sparkSession.sessionState.analyzer.resolver))
> // When resExp is a Literal, Literal is considered a partition 
> column.         
> if (resExp.references.subsetOf(partitionColumns)) {
>   return Some(l)
> } else {
>   None
> }
>   case _ => None
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38570:


Assignee: Apache Spark

> Incorrect DynamicPartitionPruning caused by Literal
> ---
>
> Key: SPARK-38570
> URL: https://issues.apache.org/jira/browse/SPARK-38570
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: mcdull_zhang
>Assignee: Apache Spark
>Priority: Minor
>
> The return value of Literal.references is an empty AttributeSet, so Literal 
> is mistaken for a partition column.
>  
> org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan:
> {code:java}
> val srcInfo: Option[(Expression, LogicalPlan)] = 
> findExpressionAndTrackLineageDown(a, plan)
> srcInfo.flatMap {
>   case (resExp, l: LogicalRelation) =>
> l.relation match {
>   case fs: HadoopFsRelation =>
> val partitionColumns = AttributeSet(
>   l.resolve(fs.partitionSchema, 
> fs.sparkSession.sessionState.analyzer.resolver))
> // When resExp is a Literal, Literal is considered a partition 
> column.         
> if (resExp.references.subsetOf(partitionColumns)) {
>   return Some(l)
> } else {
>   None
> }
>   case _ => None
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38570:


Assignee: (was: Apache Spark)

> Incorrect DynamicPartitionPruning caused by Literal
> ---
>
> Key: SPARK-38570
> URL: https://issues.apache.org/jira/browse/SPARK-38570
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: mcdull_zhang
>Priority: Minor
>
> The return value of Literal.references is an empty AttributeSet, so Literal 
> is mistaken for a partition column.
>  
> org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan:
> {code:java}
> val srcInfo: Option[(Expression, LogicalPlan)] = 
> findExpressionAndTrackLineageDown(a, plan)
> srcInfo.flatMap {
>   case (resExp, l: LogicalRelation) =>
> l.relation match {
>   case fs: HadoopFsRelation =>
> val partitionColumns = AttributeSet(
>   l.resolve(fs.partitionSchema, 
> fs.sparkSession.sessionState.analyzer.resolver))
> // When resExp is a Literal, Literal is considered a partition 
> column.         
> if (resExp.references.subsetOf(partitionColumns)) {
>   return Some(l)
> } else {
>   None
> }
>   case _ => None
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal

2022-03-16 Thread mcdull_zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mcdull_zhang updated SPARK-38570:
-
Description: 
The return value of Literal.references is an empty AttributeSet, so Literal is 
mistaken for a partition column.

 

org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan:
{code:java}
val srcInfo: Option[(Expression, LogicalPlan)] = 
findExpressionAndTrackLineageDown(a, plan)
srcInfo.flatMap {
  case (resExp, l: LogicalRelation) =>
l.relation match {
  case fs: HadoopFsRelation =>
val partitionColumns = AttributeSet(
  l.resolve(fs.partitionSchema, 
fs.sparkSession.sessionState.analyzer.resolver))
// When resExp is a Literal, Literal is considered a partition column.  
       
if (resExp.references.subsetOf(partitionColumns)) {
  return Some(l)
} else {
  None
}
  case _ => None
} {code}

  was:
The return value of Literal.references is an empty AttributeSet, so Literal is 
mistaken for a partition column.

 

org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan:
{code:java}
val srcInfo: Option[(Expression, LogicalPlan)] = 
findExpressionAndTrackLineageDown(a, plan)
srcInfo.flatMap {
  case (resExp, l: LogicalRelation) =>
l.relation match {
  case fs: HadoopFsRelation =>
val partitionColumns = AttributeSet(
  l.resolve(fs.partitionSchema, 
fs.sparkSession.sessionState.analyzer.resolver))
// When resExp is a Literal, Literal is considered a partition column.  
       if (resExp.references.subsetOf(partitionColumns)) {
  return Some(l)
} else {
  None
}
  case _ => None
} {code}


> Incorrect DynamicPartitionPruning caused by Literal
> ---
>
> Key: SPARK-38570
> URL: https://issues.apache.org/jira/browse/SPARK-38570
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: mcdull_zhang
>Priority: Minor
>
> The return value of Literal.references is an empty AttributeSet, so Literal 
> is mistaken for a partition column.
>  
> org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan:
> {code:java}
> val srcInfo: Option[(Expression, LogicalPlan)] = 
> findExpressionAndTrackLineageDown(a, plan)
> srcInfo.flatMap {
>   case (resExp, l: LogicalRelation) =>
> l.relation match {
>   case fs: HadoopFsRelation =>
> val partitionColumns = AttributeSet(
>   l.resolve(fs.partitionSchema, 
> fs.sparkSession.sessionState.analyzer.resolver))
> // When resExp is a Literal, Literal is considered a partition 
> column.         
> if (resExp.references.subsetOf(partitionColumns)) {
>   return Some(l)
> } else {
>   None
> }
>   case _ => None
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal

2022-03-16 Thread mcdull_zhang (Jira)

mcdull_zhang created SPARK-38570:


 Summary: Incorrect DynamicPartitionPruning caused by Literal
 Key: SPARK-38570
 URL: https://issues.apache.org/jira/browse/SPARK-38570
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: mcdull_zhang


The return value of Literal.references is an empty AttributeSet, so Literal is 
mistaken for a partition column.

 

org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan:
{code:java}
val srcInfo: Option[(Expression, LogicalPlan)] = 
findExpressionAndTrackLineageDown(a, plan)
srcInfo.flatMap {
  case (resExp, l: LogicalRelation) =>
l.relation match {
  case fs: HadoopFsRelation =>
val partitionColumns = AttributeSet(
  l.resolve(fs.partitionSchema, 
fs.sparkSession.sessionState.analyzer.resolver))
// When resExp is a Literal, Literal is considered a partition column.  
       if (resExp.references.subsetOf(partitionColumns)) {
  return Some(l)
} else {
  None
}
  case _ => None
} {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507585#comment-17507585
 ] 

Apache Spark commented on SPARK-38568:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/35877

> Upgrade ZSTD-JNI to 1.5.2-2
> ---
>
> Key: SPARK-38568
> URL: https://issues.apache.org/jira/browse/SPARK-38568
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38568:


Assignee: Apache Spark

> Upgrade ZSTD-JNI to 1.5.2-2
> ---
>
> Key: SPARK-38568
> URL: https://issues.apache.org/jira/browse/SPARK-38568
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38568:


Assignee: (was: Apache Spark)

> Upgrade ZSTD-JNI to 1.5.2-2
> ---
>
> Key: SPARK-38568
> URL: https://issues.apache.org/jira/browse/SPARK-38568
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507584#comment-17507584
 ] 

Apache Spark commented on SPARK-38568:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/35877

> Upgrade ZSTD-JNI to 1.5.2-2
> ---
>
> Key: SPARK-38568
> URL: https://issues.apache.org/jira/browse/SPARK-38568
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38569) external top-level directory is problematic for bazel

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507583#comment-17507583
 ] 

Apache Spark commented on SPARK-38569:
--

User 'alkis' has created a pull request for this issue:
https://github.com/apache/spark/pull/35874

> external top-level directory is problematic for bazel
> -
>
> Key: SPARK-38569
> URL: https://issues.apache.org/jira/browse/SPARK-38569
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.1
>Reporter: Alkis Evlogimenos
>Priority: Minor
>  Labels: build
>
> {{external}} is a hardwired special name for top-level directories for 
> [bazel|https://bazel.build/]. This causes all sorts of issues with both 
> native/basic bazel or extensions like 
> [bazel-compile-commands-extractor|https://github.com/hedronvision/bazel-compile-commands-extractor].
>  Spark forks using bazel to build Spark have to go through hoops to make 
> things work if at all.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38569) external top-level directory is problematic for bazel

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38569:


Assignee: Apache Spark

> external top-level directory is problematic for bazel
> -
>
> Key: SPARK-38569
> URL: https://issues.apache.org/jira/browse/SPARK-38569
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.1
>Reporter: Alkis Evlogimenos
>Assignee: Apache Spark
>Priority: Minor
>  Labels: build
>
> {{external}} is a hardwired special name for top-level directories for 
> [bazel|https://bazel.build/]. This causes all sorts of issues with both 
> native/basic bazel or extensions like 
> [bazel-compile-commands-extractor|https://github.com/hedronvision/bazel-compile-commands-extractor].
>  Spark forks using bazel to build Spark have to go through hoops to make 
> things work if at all.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38569) external top-level directory is problematic for bazel

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38569:


Assignee: (was: Apache Spark)

> external top-level directory is problematic for bazel
> -
>
> Key: SPARK-38569
> URL: https://issues.apache.org/jira/browse/SPARK-38569
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.1
>Reporter: Alkis Evlogimenos
>Priority: Minor
>  Labels: build
>
> {{external}} is a hardwired special name for top-level directories for 
> [bazel|https://bazel.build/]. This causes all sorts of issues with both 
> native/basic bazel or extensions like 
> [bazel-compile-commands-extractor|https://github.com/hedronvision/bazel-compile-commands-extractor].
>  Spark forks using bazel to build Spark have to go through hoops to make 
> things work if at all.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38569) external top-level directory is problematic for bazel

2022-03-16 Thread Alkis Evlogimenos (Jira)

Alkis Evlogimenos created SPARK-38569:
-

 Summary: external top-level directory is problematic for bazel
 Key: SPARK-38569
 URL: https://issues.apache.org/jira/browse/SPARK-38569
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.2.1
Reporter: Alkis Evlogimenos


{{external}} is a hardwired special name for top-level directories for 
[bazel|https://bazel.build/]. This causes all sorts of issues with both 
native/basic bazel or extensions like 
[bazel-compile-commands-extractor|https://github.com/hedronvision/bazel-compile-commands-extractor].
 Spark forks using bazel to build Spark have to go through hoops to make things 
work if at all.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2

2022-03-16 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-38568:
---

 Summary: Upgrade ZSTD-JNI to 1.5.2-2
 Key: SPARK-38568
 URL: https://issues.apache.org/jira/browse/SPARK-38568
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun
Assignee: Dongjoon Hyun
 Fix For: 3.3.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2

2022-03-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-38568:

Fix Version/s: (was: 3.3.0)

> Upgrade ZSTD-JNI to 1.5.2-2
> ---
>
> Key: SPARK-38568
> URL: https://issues.apache.org/jira/browse/SPARK-38568
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38568) Upgrade ZSTD-JNI to 1.5.2-2

2022-03-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-38568:
---

Assignee: (was: Dongjoon Hyun)

> Upgrade ZSTD-JNI to 1.5.2-2
> ---
>
> Key: SPARK-38568
> URL: https://issues.apache.org/jira/browse/SPARK-38568
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38330) Certificate doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]

2022-03-16 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507557#comment-17507557
 ] 

Steve Loughran commented on SPARK-38330:


sorry about that. try enabling path style access and see if that helps

> Certificate doesn't match any of the subject alternative names: 
> [*.s3.amazonaws.com, s3.amazonaws.com]
> --
>
> Key: SPARK-38330
> URL: https://issues.apache.org/jira/browse/SPARK-38330
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 3.2.1
> Environment: Spark 3.2.1 built with `hadoop-cloud` flag.
> Direct access to s3 using default file committer.
> JDK8.
>  
>Reporter: André F.
>Priority: Major
>
> Trying to run any job after bumping our Spark version from 3.1.2 to 3.2.1, 
> lead us to the current exception while reading files on s3:
> {code:java}
> org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on 
> s3a:///.parquet: com.amazonaws.SdkClientException: Unable to 
> execute HTTP request: Certificate for  doesn't match 
> any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: 
> Unable to execute HTTP request: Certificate for  doesn't match any of 
> the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at 
> org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185)
>  at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) 
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274) 
> at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245) at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:596) {code}
>  
> {code:java}
> Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for 
>  doesn't match any of the subject alternative names: 
> [*.s3.amazonaws.com, s3.amazonaws.com]
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:507)
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:437)
>   at 
> com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
>   at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
>   at com.amazonaws.http.conn.$Proxy16.connect(Unknown Source)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
>   at 
> com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
>   at 
> com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1333)
>   at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.e

[jira] [Assigned] (SPARK-38567) Enable GitHub Action build_and_test on branch-3.3

2022-03-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38567:


Assignee: Max Gekk

> Enable GitHub Action build_and_test on branch-3.3
> -
>
> Key: SPARK-38567
> URL: https://issues.apache.org/jira/browse/SPARK-38567
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Max Gekk
>Priority: Major
>
> See https://issues.apache.org/jira/browse/SPARK-35995



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38567) Enable GitHub Action build_and_test on branch-3.3

2022-03-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38567.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35876
[https://github.com/apache/spark/pull/35876]

> Enable GitHub Action build_and_test on branch-3.3
> -
>
> Key: SPARK-38567
> URL: https://issues.apache.org/jira/browse/SPARK-38567
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>
> See https://issues.apache.org/jira/browse/SPARK-35995



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38567) Enable GitHub Action build_and_test on branch-3.3

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38567:


Assignee: Apache Spark

> Enable GitHub Action build_and_test on branch-3.3
> -
>
> Key: SPARK-38567
> URL: https://issues.apache.org/jira/browse/SPARK-38567
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> See https://issues.apache.org/jira/browse/SPARK-35995



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38567) Enable GitHub Action build_and_test on branch-3.3

2022-03-16 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38567:


Assignee: (was: Apache Spark)

> Enable GitHub Action build_and_test on branch-3.3
> -
>
> Key: SPARK-38567
> URL: https://issues.apache.org/jira/browse/SPARK-38567
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See https://issues.apache.org/jira/browse/SPARK-35995



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38567) Enable GitHub Action build_and_test on branch-3.3

2022-03-16 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507532#comment-17507532
 ] 

Apache Spark commented on SPARK-38567:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/35876

> Enable GitHub Action build_and_test on branch-3.3
> -
>
> Key: SPARK-38567
> URL: https://issues.apache.org/jira/browse/SPARK-38567
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See https://issues.apache.org/jira/browse/SPARK-35995



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38108) Use error classes in the compilation errors of UDF/UDAF

2022-03-16 Thread huangtengfei (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507528#comment-17507528
 ] 

huangtengfei commented on SPARK-38108:
--

I am working on this. Thanks [~maxgekk]

> Use error classes in the compilation errors of UDF/UDAF
> ---
>
> Key: SPARK-38108
> URL: https://issues.apache.org/jira/browse/SPARK-38108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryCompilationErrors:
> * noHandlerForUDAFError
> * unexpectedEvalTypesForUDFsError
> * usingUntypedScalaUDFError
> * udfClassDoesNotImplementAnyUDFInterfaceError
> * udfClassNotAllowedToImplementMultiUDFInterfacesError
> * udfClassWithTooManyTypeArgumentsError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryCompilationErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 128 matches

Mail list logo