[jira] [Assigned] (SPARK-38651) Writing out empty or nested empty schemas in Datasource should be configurable

2022-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38651:


Assignee: Apache Spark

> Writing out empty or nested empty schemas in Datasource should be configurable
> --
>
> Key: SPARK-38651
> URL: https://issues.apache.org/jira/browse/SPARK-38651
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Thejdeep Gudivada
>Assignee: Apache Spark
>Priority: Major
>
> In SPARK-23372, we introduced a backwards incompatible change that would 
> remove support for writing out empty or nested empty schemas in file based 
> datasources. This introduces backward incompatibility for users who have been 
> using a schema that met the above condition since the datasource supported 
> it. Except for Parquet and text, other file based sources support this 
> behavior.
>  
> We should either :
>  * Make it configurable to enable/disable writing out empty schemas
>  * Enable the validation check only for sources that do not support it - 
> Parquet / Text



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38651) Writing out empty or nested empty schemas in Datasource should be configurable

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512202#comment-17512202
 ] 

Apache Spark commented on SPARK-38651:
--

User 'thejdeep' has created a pull request for this issue:
https://github.com/apache/spark/pull/35969

> Writing out empty or nested empty schemas in Datasource should be configurable
> --
>
> Key: SPARK-38651
> URL: https://issues.apache.org/jira/browse/SPARK-38651
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Thejdeep Gudivada
>Priority: Major
>
> In SPARK-23372, we introduced a backwards incompatible change that would 
> remove support for writing out empty or nested empty schemas in file based 
> datasources. This introduces backward incompatibility for users who have been 
> using a schema that met the above condition since the datasource supported 
> it. Except for Parquet and text, other file based sources support this 
> behavior.
>  
> We should either :
>  * Make it configurable to enable/disable writing out empty schemas
>  * Enable the validation check only for sources that do not support it - 
> Parquet / Text



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38651) Writing out empty or nested empty schemas in Datasource should be configurable

2022-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38651:


Assignee: (was: Apache Spark)

> Writing out empty or nested empty schemas in Datasource should be configurable
> --
>
> Key: SPARK-38651
> URL: https://issues.apache.org/jira/browse/SPARK-38651
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Thejdeep Gudivada
>Priority: Major
>
> In SPARK-23372, we introduced a backwards incompatible change that would 
> remove support for writing out empty or nested empty schemas in file based 
> datasources. This introduces backward incompatibility for users who have been 
> using a schema that met the above condition since the datasource supported 
> it. Except for Parquet and text, other file based sources support this 
> behavior.
>  
> We should either :
>  * Make it configurable to enable/disable writing out empty schemas
>  * Enable the validation check only for sources that do not support it - 
> Parquet / Text



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38654) Show default index type in SQL plans for pandas API on Spark

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512198#comment-17512198
 ] 

Apache Spark commented on SPARK-38654:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35968

> Show default index type in SQL plans for pandas API on Spark
> 
>
> Key: SPARK-38654
> URL: https://issues.apache.org/jira/browse/SPARK-38654
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, it's difficult for users to tell which plan and expressions are 
> for default index from explain API.
> We should mark and show which plan/expression is for the default index in 
> pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38654) Show default index type in SQL plans for pandas API on Spark

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512197#comment-17512197
 ] 

Apache Spark commented on SPARK-38654:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35968

> Show default index type in SQL plans for pandas API on Spark
> 
>
> Key: SPARK-38654
> URL: https://issues.apache.org/jira/browse/SPARK-38654
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, it's difficult for users to tell which plan and expressions are 
> for default index from explain API.
> We should mark and show which plan/expression is for the default index in 
> pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38654) Show default index type in SQL plans for pandas API on Spark

2022-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38654:


Assignee: Apache Spark

> Show default index type in SQL plans for pandas API on Spark
> 
>
> Key: SPARK-38654
> URL: https://issues.apache.org/jira/browse/SPARK-38654
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, it's difficult for users to tell which plan and expressions are 
> for default index from explain API.
> We should mark and show which plan/expression is for the default index in 
> pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38654) Show default index type in SQL plans for pandas API on Spark

2022-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38654:


Assignee: (was: Apache Spark)

> Show default index type in SQL plans for pandas API on Spark
> 
>
> Key: SPARK-38654
> URL: https://issues.apache.org/jira/browse/SPARK-38654
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, it's difficult for users to tell which plan and expressions are 
> for default index from explain API.
> We should mark and show which plan/expression is for the default index in 
> pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38654) Show default index type in SQL plans for pandas API on Spark

2022-03-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38654:
-
Priority: Minor  (was: Major)

> Show default index type in SQL plans for pandas API on Spark
> 
>
> Key: SPARK-38654
> URL: https://issues.apache.org/jira/browse/SPARK-38654
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, it's difficult for users to tell which plan and expressions are 
> for default index from explain API.
> We should mark and show which plan/expression is for the default index in 
> pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38654) Show default index type in SQL plans for pandas API on Spark

2022-03-24 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-38654:


 Summary: Show default index type in SQL plans for pandas API on 
Spark
 Key: SPARK-38654
 URL: https://issues.apache.org/jira/browse/SPARK-38654
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


Currently, it's difficult for users to tell which plan and expressions are for 
default index from explain API.

We should mark and show which plan/expression is for the default index in 
pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512195#comment-17512195
 ] 

Apache Spark commented on SPARK-38570:
--

User 'mcdull-zhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35967

> Incorrect DynamicPartitionPruning caused by Literal
> ---
>
> Key: SPARK-38570
> URL: https://issues.apache.org/jira/browse/SPARK-38570
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: mcdull_zhang
>Assignee: mcdull_zhang
>Priority: Minor
> Fix For: 3.3.0
>
>
> The return value of Literal.references is an empty AttributeSet, so Literal 
> is mistaken for a partition column.
>  
> org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan:
> {code:java}
> val srcInfo: Option[(Expression, LogicalPlan)] = 
> findExpressionAndTrackLineageDown(a, plan)
> srcInfo.flatMap {
>   case (resExp, l: LogicalRelation) =>
> l.relation match {
>   case fs: HadoopFsRelation =>
> val partitionColumns = AttributeSet(
>   l.resolve(fs.partitionSchema, 
> fs.sparkSession.sessionState.analyzer.resolver))
> // When resExp is a Literal, Literal is considered a partition 
> column.         
> if (resExp.references.subsetOf(partitionColumns)) {
>   return Some(l)
> } else {
>   None
> }
>   case _ => None
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38610) UI for Pandas API on Spark

2022-03-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38610.
--
Resolution: Invalid

Will focus on improving existing SQL UI to show the related info.

> UI for Pandas API on Spark
> --
>
> Key: SPARK-38610
> URL: https://issues.apache.org/jira/browse/SPARK-38610
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Web UI
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Critical
>
> Currently Pandas API on Spark does not have its dedicated UI which mixes up 
> with SQL UI tab. It should be better to have a dedicated page



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38646) Pull a trait out for Python functions

2022-03-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-38646:


Assignee: Zhen Li

> Pull a trait out for Python functions
> -
>
> Key: SPARK-38646
> URL: https://issues.apache.org/jira/browse/SPARK-38646
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Zhen Li
>Assignee: Zhen Li
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently pyspark uses a case class PythonFunction PythonRDD and many other 
> interfaces/classes. Propose to change to use a trait instead to avoid tying 
> impl with APIs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38646) Pull a trait out for Python functions

2022-03-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38646.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35964
[https://github.com/apache/spark/pull/35964]

> Pull a trait out for Python functions
> -
>
> Key: SPARK-38646
> URL: https://issues.apache.org/jira/browse/SPARK-38646
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Zhen Li
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently pyspark uses a case class PythonFunction PythonRDD and many other 
> interfaces/classes. Propose to change to use a trait instead to avoid tying 
> impl with APIs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38653) Repartition by Column that is Int not working properly only on particular numbers. (11, 33)

2022-03-24 Thread John Engelhart (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Engelhart updated SPARK-38653:
---
Description: 
My understanding is when you call .repartition(a column). For each unique key 
in that field. The data will go to that partition. There should never be two 
keys repartitioned to the same part. That behavior is true with a String 
column. That behavior is also true with an Int column except on certain 
numbers. In my use case. The magic numbers 11 and 33.
{code:java}
//Int based column repartition
spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path")
//Produces two part files
//String based column repartition
spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path1")
//Produces three part files {code}
 
{code:java}
//Not working as expected
spark.read.parquet("path/part-0...").distinct.show
spark.read.parquet("path/part-1...").distinct.show

//Working as expected
spark.read.parquet("path1/part-0...").distinct.show
spark.read.parquet("path1/part-1...").distinct.show
spark.read.parquet("path1/part-2...").distinct.show {code}
!image-2022-03-24-22-16-44-560.png!

This problem really manifested itself when doing something like
{code:java}
spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex"). 
repartition($"collectionIndex").write.mode("overwrite").partitionBy("collectionIndex").parquet("path")
 {code}
Because you end up with incorrect partitions where the data is commingled. 

  was:
My understanding is when you call .repartition(a column). For each unique key 
in that field. The data will go to that partition. There should never be two 
keys repartitioned to the same part. That behavior is true with a String 
column. That behavior is also true with an Int column except on certain 
numbers. In my use case. The magic numbers 11 and 33.
{code:java}
//Int based column repartition
spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path")
//Produces two part files
//String based column repartition
spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path1")
//Produces three part files {code}
 
{code:java}
//Not working as expected
spark.read.parquet("path/part-0...").distinct.show
spark.read.parquet("path/part-1...").distinct.show

//Working as expected
spark.read.parquet("path1/part-0...").distinct.show
spark.read.parquet("path1/part-1...").distinct.show
spark.read.parquet("path1/part-2...").distinct.show {code}
!image-2022-03-24-22-09-26-917.png!


> Repartition by Column that is Int not working properly only on particular 
> numbers. (11, 33)
> ---
>
> Key: SPARK-38653
> URL: https://issues.apache.org/jira/browse/SPARK-38653
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
> Environment: This was running on EMR 6.4.0 using Spark 3.1.2 in an 
> EMR Notebook writing to S3
>Reporter: John Engelhart
>Priority: Major
>
> My understanding is when you call .repartition(a column). For each unique key 
> in that field. The data will go to that partition. There should never be two 
> keys repartitioned to the same part. That behavior is true with a String 
> column. That behavior is also true with an Int column except on certain 
> numbers. In my use case. The magic numbers 11 and 33.
> {code:java}
> //Int based column repartition
> spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex").
> repartition($"collectionIndex").write.mode("overwrite").parquet("path")
> //Produces two part files
> //String based column repartition
> spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex").
> repartition($"collectionIndex").write.mode("overwrite").parquet("path1")
> //Produces three part files {code}
>  
> {code:java}
> //Not working as expected
> spark.read.parquet("path/part-0...").distinct.show
> spark.read.parquet("path/part-1...").distinct.show
> //Working as expected
> spark.read.parquet("path1/part-0...").distinct.show
> spark.read.parquet("path1/part-1...").distinct.show
> spark.read.parquet("path1/part-2...").distinct.show {code}
> !image-2022-03-24-22-16-44-560.png!
> This problem really manifested itself when doing something like
> {code:java}
> spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex"). 
> repartition($"collectionIndex").write.mode("overwrite").partitionBy("collectionIndex").parquet("path")
>  

[jira] [Updated] (SPARK-38653) Repartition by Column that is Int not working properly only on particular numbers. (11, 33)

2022-03-24 Thread John Engelhart (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Engelhart updated SPARK-38653:
---
Description: 
My understanding is when you call .repartition(a column). For each unique key 
in that field. The data will go to that partition. There should never be two 
keys repartitioned to the same part. That behavior is true with a String 
column. That behavior is also true with an Int column except on certain 
numbers. In my use case. The magic numbers 11 and 33.
{code:java}
//Int based column repartition
spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path")
//Produces two part files
//String based column repartition
spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path1")
//Produces three part files {code}
 
{code:java}
//Not working as expected
spark.read.parquet("path/part-0...").distinct.show
spark.read.parquet("path/part-1...").distinct.show

//Working as expected
spark.read.parquet("path1/part-0...").distinct.show
spark.read.parquet("path1/part-1...").distinct.show
spark.read.parquet("path1/part-2...").distinct.show {code}
!image-2022-03-24-22-09-26-917.png!

  was:
My understanding is when you call .repartition(a column). For each unique key 
in that field. The data will go to that partition. There should never be two 
keys repartitioned the same part. That behavior is true with a String column. 
That behavior is also true with an Int column except on certain numbers. In my 
use case. The magic numbers 11 and 33.
{code:java}
//Int based column repartition
spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path")
//Produces two part files
//String based column repartition
spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path1")
//Produces three part files {code}
 
{code:java}
//Not working as expected
spark.read.parquet("path/part-0...").distinct.show
spark.read.parquet("path/part-1...").distinct.show

//Working as expected
spark.read.parquet("path1/part-0...").distinct.show
spark.read.parquet("path1/part-1...").distinct.show
spark.read.parquet("path1/part-2...").distinct.show {code}
!image-2022-03-24-22-09-26-917.png!


> Repartition by Column that is Int not working properly only on particular 
> numbers. (11, 33)
> ---
>
> Key: SPARK-38653
> URL: https://issues.apache.org/jira/browse/SPARK-38653
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
> Environment: This was running on EMR 6.4.0 using Spark 3.1.2 in an 
> EMR Notebook writing to S3
>Reporter: John Engelhart
>Priority: Major
>
> My understanding is when you call .repartition(a column). For each unique key 
> in that field. The data will go to that partition. There should never be two 
> keys repartitioned to the same part. That behavior is true with a String 
> column. That behavior is also true with an Int column except on certain 
> numbers. In my use case. The magic numbers 11 and 33.
> {code:java}
> //Int based column repartition
> spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex").
> repartition($"collectionIndex").write.mode("overwrite").parquet("path")
> //Produces two part files
> //String based column repartition
> spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex").
> repartition($"collectionIndex").write.mode("overwrite").parquet("path1")
> //Produces three part files {code}
>  
> {code:java}
> //Not working as expected
> spark.read.parquet("path/part-0...").distinct.show
> spark.read.parquet("path/part-1...").distinct.show
> //Working as expected
> spark.read.parquet("path1/part-0...").distinct.show
> spark.read.parquet("path1/part-1...").distinct.show
> spark.read.parquet("path1/part-2...").distinct.show {code}
> !image-2022-03-24-22-09-26-917.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38653) Repartition by Column that is Int not working properly only on particular numbers. (11, 33)

2022-03-24 Thread John Engelhart (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Engelhart updated SPARK-38653:
---
Summary: Repartition by Column that is Int not working properly only on 
particular numbers. (11, 33)  (was: Repartition by Column that is Int not 
working properly only particular numbers. (11, 33))

> Repartition by Column that is Int not working properly only on particular 
> numbers. (11, 33)
> ---
>
> Key: SPARK-38653
> URL: https://issues.apache.org/jira/browse/SPARK-38653
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
> Environment: This was running on EMR 6.4.0 using Spark 3.1.2 in an 
> EMR Notebook writing to S3
>Reporter: John Engelhart
>Priority: Major
>
> My understanding is when you call .repartition(a column). For each unique key 
> in that field. The data will go to that partition. There should never be two 
> keys repartitioned the same part. That behavior is true with a String column. 
> That behavior is also true with an Int column except on certain numbers. In 
> my use case. The magic numbers 11 and 33.
> {code:java}
> //Int based column repartition
> spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex").
> repartition($"collectionIndex").write.mode("overwrite").parquet("path")
> //Produces two part files
> //String based column repartition
> spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex").
> repartition($"collectionIndex").write.mode("overwrite").parquet("path1")
> //Produces three part files {code}
>  
> {code:java}
> //Not working as expected
> spark.read.parquet("path/part-0...").distinct.show
> spark.read.parquet("path/part-1...").distinct.show
> //Working as expected
> spark.read.parquet("path1/part-0...").distinct.show
> spark.read.parquet("path1/part-1...").distinct.show
> spark.read.parquet("path1/part-2...").distinct.show {code}
> !image-2022-03-24-22-09-26-917.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38653) Repartition by Column that is Int not working properly only particular numbers. (11, 33)

2022-03-24 Thread John Engelhart (Jira)
John Engelhart created SPARK-38653:
--

 Summary: Repartition by Column that is Int not working properly 
only particular numbers. (11, 33)
 Key: SPARK-38653
 URL: https://issues.apache.org/jira/browse/SPARK-38653
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.2
 Environment: This was running on EMR 6.4.0 using Spark 3.1.2 in an EMR 
Notebook writing to S3
Reporter: John Engelhart


My understanding is when you call .repartition(a column). For each unique key 
in that field. The data will go to that partition. There should never be two 
keys repartitioned the same part. That behavior is true with a String column. 
That behavior is also true with an Int column except on certain numbers. In my 
use case. The magic numbers 11 and 33.
{code:java}
//Int based column repartition
spark.sparkContext.parallelize(Seq(1, 11, 33)).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path")
//Produces two part files
//String based column repartition
spark.sparkContext.parallelize(Seq("1", "11", "33")).toDF("collectionIndex").
repartition($"collectionIndex").write.mode("overwrite").parquet("path1")
//Produces three part files {code}
 
{code:java}
//Not working as expected
spark.read.parquet("path/part-0...").distinct.show
spark.read.parquet("path/part-1...").distinct.show

//Working as expected
spark.read.parquet("path1/part-0...").distinct.show
spark.read.parquet("path1/part-1...").distinct.show
spark.read.parquet("path1/part-2...").distinct.show {code}
!image-2022-03-24-22-09-26-917.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38652) K8S IT Test DepsTestsSuite blocks with PathIOException in hadoop-aws-3.3.2

2022-03-24 Thread qian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qian updated SPARK-38652:
-
Description: 
DepsTestsSuite in k8s IT test is blocked with PathIOException in 
hadoop-aws-3.3.2. Exception Message is as follow
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Uploading file 
/Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar
 failed...
at 
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:332)

at 
org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:277)

at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)   
 
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
   
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at 
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:275)

at 
org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:187)
   
at scala.collection.immutable.List.foreach(List.scala:431)
at 
org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:178)

at 
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$5(KubernetesDriverBuilder.scala:86)
at 
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)  
  
at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)  
  
at scala.collection.immutable.List.foldLeft(List.scala:91)
at 
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:84)

at 
org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:104)

at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5(KubernetesClientApplication.scala:248)

at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5$adapted(KubernetesClientApplication.scala:242)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2738)
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:242)

at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:214)

at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)

at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)   
 
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) 
   
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
org.apache.spark.SparkException: Error uploading file 
spark-examples_2.12-3.4.0-SNAPSHOT.jar
at 
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:355)

at 
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:328)

... 30 more
Caused by: org.apache.hadoop.fs.PathIOException: `Cannot get relative path for 
URI:file:///Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar':
 Input/output error
at 
org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.getFinalPath(CopyFromLocalOperation.java:365)

at 
org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.uploadSourceFromFS(CopyFromLocalOperation.java:226)

at 
org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.execute(CopyFromLocalOperation.java:170)

at 
org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$copyFromLocalFile$25(S3AFileSystem.java:3920)

at 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
at 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)

at 
org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337)

at 
org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356)

at 

[jira] [Updated] (SPARK-38652) K8S IT Test DepsTestsSuite blocks with PathIOException in hadoop-aws-3.3.2

2022-03-24 Thread qian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qian updated SPARK-38652:
-
Description: 
DepsTestsSuite in k8s IT test is blocked with PathIOException in 
hadoop-aws-3.3.2. Exception Message is as follow
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Uploading file 
/Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar
 failed...
at 
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:332)

at 
org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:277)

at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)   
 
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
   
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at 
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:275)

at 
org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:187)
   
at scala.collection.immutable.List.foreach(List.scala:431)
at 
org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:178)

at 
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$5(KubernetesDriverBuilder.scala:86)
at 
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)  
  
at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)  
  
at scala.collection.immutable.List.foldLeft(List.scala:91)
at 
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:84)

at 
org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:104)

at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5(KubernetesClientApplication.scala:248)

at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5$adapted(KubernetesClientApplication.scala:242)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2738)   
 
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:242)

at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:214)

at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)

at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)   
 
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) 
   
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
org.apache.spark.SparkException: Error uploading file 
spark-examples_2.12-3.4.0-SNAPSHOT.jar
at 
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:355)

at 
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:328)

... 30 more
Caused by: org.apache.hadoop.fs.PathIOException: `Cannot get relative path for 
URI:file:///Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar':
 Input/output errorat 
org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.getFinalPath(CopyFromLocalOperation.java:365)

at 
org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.uploadSourceFromFS(CopyFromLocalOperation.java:226)

at 
org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.execute(CopyFromLocalOperation.java:170)

at 
org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$copyFromLocalFile$25(S3AFileSystem.java:3920)

at 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
at 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)

at 
org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2337)

at 
org.apache.hadoop.fs.s3a.S3AFileSystem.trackDurationAndSpan(S3AFileSystem.java:2356)

at 

[jira] [Commented] (SPARK-38652) K8S IT Test DepsTestsSuite blocks with PathIOException in hadoop-aws-3.3.2

2022-03-24 Thread qian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512163#comment-17512163
 ] 

qian commented on SPARK-38652:
--

I am working on it.

cc [~chaosun]  & [~dongjoon] 

> K8S IT Test DepsTestsSuite blocks with PathIOException in hadoop-aws-3.3.2
> --
>
> Key: SPARK-38652
> URL: https://issues.apache.org/jira/browse/SPARK-38652
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: qian
>Priority: Major
>
> DepsTestsSuite in k8s IT test is blocked with PathIOException in 
> hadoop-aws-3.3.2. Exception Message is as follow
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Uploading file 
> /Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar
>  failed...
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:332)
> 
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:277)
> 
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) 
>
> at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)   
>  
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)  
>   
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at scala.collection.TraversableLike.map(TraversableLike.scala:286)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:275)
> 
> at 
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:187)
>
> at scala.collection.immutable.List.foreach(List.scala:431)
> at 
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:178)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$5(KubernetesDriverBuilder.scala:86)
> at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
> 
> at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)   
>  
> at scala.collection.immutable.List.foldLeft(List.scala:91)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:84)
> 
> at 
> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:104)
> 
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5(KubernetesClientApplication.scala:248)
> 
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5$adapted(KubernetesClientApplication.scala:242)
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2738) 
>
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:242)
> 
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:214)
> 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
> 
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) 
>
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)  
>   
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> org.apache.spark.SparkException: Error uploading file 
> spark-examples_2.12-3.4.0-SNAPSHOT.jar
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:355)
> 
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:328)
> 
> ... 30 more
> Caused by: org.apache.hadoop.fs.PathIOException: `Cannot get relative path 
> for 
> URI:file:///Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar':
>  Input/output errorat 
> org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.getFinalPath(CopyFromLocalOperation.java:365)
> 
> at 
> 

[jira] [Created] (SPARK-38652) K8S IT Test DepsTestsSuite blocks with PathIOException in hadoop-aws-3.3.2

2022-03-24 Thread qian (Jira)
qian created SPARK-38652:


 Summary: K8S IT Test DepsTestsSuite blocks with PathIOException in 
hadoop-aws-3.3.2
 Key: SPARK-38652
 URL: https://issues.apache.org/jira/browse/SPARK-38652
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Tests
Affects Versions: 3.3.0
Reporter: qian


DepsTestsSuite in k8s IT test is blocked with PathIOException in 
hadoop-aws-3.3.2. Exception Message is as follow
{code:java}
Exception in thread "main" org.apache.spark.SparkException: Uploading file 
/Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar
 failed...
at 
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:332)

at 
org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:277)

at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)   
 
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
   
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at 
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:275)

at 
org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:187)
   
at scala.collection.immutable.List.foreach(List.scala:431)
at 
org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:178)
at 
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$5(KubernetesDriverBuilder.scala:86)
at 
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)  
  
at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)  
  
at scala.collection.immutable.List.foldLeft(List.scala:91)
at 
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:84)

at 
org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:104)

at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5(KubernetesClientApplication.scala:248)

at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5$adapted(KubernetesClientApplication.scala:242)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2738)   
 
at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:242)

at 
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:214)

at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)

at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)   
 
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046) 
   
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
org.apache.spark.SparkException: Error uploading file 
spark-examples_2.12-3.4.0-SNAPSHOT.jar
at 
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:355)

at 
org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:328)

... 30 more
Caused by: org.apache.hadoop.fs.PathIOException: `Cannot get relative path for 
URI:file:///Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar':
 Input/output errorat 
org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.getFinalPath(CopyFromLocalOperation.java:365)

at 
org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.uploadSourceFromFS(CopyFromLocalOperation.java:226)

at 
org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.execute(CopyFromLocalOperation.java:170)

at 
org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$copyFromLocalFile$25(S3AFileSystem.java:3920)

at 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:499)
at 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDuration(IOStatisticsBinding.java:444)

at 

[jira] [Created] (SPARK-38651) Writing out empty or nested empty schemas in Datasource should be configurable

2022-03-24 Thread Thejdeep Gudivada (Jira)
Thejdeep Gudivada created SPARK-38651:
-

 Summary: Writing out empty or nested empty schemas in Datasource 
should be configurable
 Key: SPARK-38651
 URL: https://issues.apache.org/jira/browse/SPARK-38651
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Thejdeep Gudivada


In SPARK-23372, we introduced a backwards incompatible change that would remove 
support for writing out empty or nested empty schemas in file based 
datasources. This introduces backward incompatibility for users who have been 
using a schema that met the above condition since the datasource supported it. 
Except for Parquet and text, other file based sources support this behavior.

 

We should either :
 * Make it configurable to enable/disable writing out empty schemas
 * Enable the validation check only for sources that do not support it - 
Parquet / Text



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal

2022-03-24 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-38570.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35878
https://github.com/apache/spark/pull/35878

> Incorrect DynamicPartitionPruning caused by Literal
> ---
>
> Key: SPARK-38570
> URL: https://issues.apache.org/jira/browse/SPARK-38570
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: mcdull_zhang
>Assignee: mcdull_zhang
>Priority: Minor
> Fix For: 3.3.0
>
>
> The return value of Literal.references is an empty AttributeSet, so Literal 
> is mistaken for a partition column.
>  
> org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan:
> {code:java}
> val srcInfo: Option[(Expression, LogicalPlan)] = 
> findExpressionAndTrackLineageDown(a, plan)
> srcInfo.flatMap {
>   case (resExp, l: LogicalRelation) =>
> l.relation match {
>   case fs: HadoopFsRelation =>
> val partitionColumns = AttributeSet(
>   l.resolve(fs.partitionSchema, 
> fs.sparkSession.sessionState.analyzer.resolver))
> // When resExp is a Literal, Literal is considered a partition 
> column.         
> if (resExp.references.subsetOf(partitionColumns)) {
>   return Some(l)
> } else {
>   None
> }
>   case _ => None
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38570) Incorrect DynamicPartitionPruning caused by Literal

2022-03-24 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-38570:
---

Assignee: mcdull_zhang

> Incorrect DynamicPartitionPruning caused by Literal
> ---
>
> Key: SPARK-38570
> URL: https://issues.apache.org/jira/browse/SPARK-38570
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: mcdull_zhang
>Assignee: mcdull_zhang
>Priority: Minor
>
> The return value of Literal.references is an empty AttributeSet, so Literal 
> is mistaken for a partition column.
>  
> org.apache.spark.sql.execution.dynamicpruning.PartitionPruning#getFilterableTableScan:
> {code:java}
> val srcInfo: Option[(Expression, LogicalPlan)] = 
> findExpressionAndTrackLineageDown(a, plan)
> srcInfo.flatMap {
>   case (resExp, l: LogicalRelation) =>
> l.relation match {
>   case fs: HadoopFsRelation =>
> val partitionColumns = AttributeSet(
>   l.resolve(fs.partitionSchema, 
> fs.sparkSession.sessionState.analyzer.resolver))
> // When resExp is a Literal, Literal is considered a partition 
> column.         
> if (resExp.references.subsetOf(partitionColumns)) {
>   return Some(l)
> } else {
>   None
> }
>   case _ => None
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38645) Support `spark.sql.codegen.cleanedSourcePrint` flag to print Codegen cleanedSource

2022-03-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38645:
-
Fix Version/s: (was: 3.2.1)

> Support `spark.sql.codegen.cleanedSourcePrint` flag to print Codegen 
> cleanedSource
> --
>
> Key: SPARK-38645
> URL: https://issues.apache.org/jira/browse/SPARK-38645
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: tonydoen
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When we use spark-sql,  encountering problems in codegen source, we often 
> have to change the log level to DEBUG, but there are too many logs in this 
> mode (DEBUG) .
>  
> Then `spark.sql.codegen.cleanedSourcePrint` can ensure that just printing 
> codegen source.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38650) Better ParseException message for char without length

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512126#comment-17512126
 ] 

Apache Spark commented on SPARK-38650:
--

User 'anchovYu' has created a pull request for this issue:
https://github.com/apache/spark/pull/35966

> Better ParseException message for char without length
> -
>
> Key: SPARK-38650
> URL: https://issues.apache.org/jira/browse/SPARK-38650
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Priority: Major
>
> We support char and varchar types. But when users input the type without 
> length, the message is confusing and not helpful at all:
> {code:sql}
> > SELECT cast('a' as CHAR)
> DataType char is not supported.(line 1, pos 19)
> == SQL ==
> SELECT cast('a' AS CHAR)
> ---^^^{code}
> This ticket would like to improve the error message for these special cases 
> to:
> {code:java}
> Datatype char requires a length parameter, for example char(10). Please 
> specify the length.{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38650) Better ParseException message for char without length

2022-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38650:


Assignee: Apache Spark

> Better ParseException message for char without length
> -
>
> Key: SPARK-38650
> URL: https://issues.apache.org/jira/browse/SPARK-38650
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Assignee: Apache Spark
>Priority: Major
>
> We support char and varchar types. But when users input the type without 
> length, the message is confusing and not helpful at all:
> {code:sql}
> > SELECT cast('a' as CHAR)
> DataType char is not supported.(line 1, pos 19)
> == SQL ==
> SELECT cast('a' AS CHAR)
> ---^^^{code}
> This ticket would like to improve the error message for these special cases 
> to:
> {code:java}
> Datatype char requires a length parameter, for example char(10). Please 
> specify the length.{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38650) Better ParseException message for char without length

2022-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38650:


Assignee: (was: Apache Spark)

> Better ParseException message for char without length
> -
>
> Key: SPARK-38650
> URL: https://issues.apache.org/jira/browse/SPARK-38650
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinyi Yu
>Priority: Major
>
> We support char and varchar types. But when users input the type without 
> length, the message is confusing and not helpful at all:
> {code:sql}
> > SELECT cast('a' as CHAR)
> DataType char is not supported.(line 1, pos 19)
> == SQL ==
> SELECT cast('a' AS CHAR)
> ---^^^{code}
> This ticket would like to improve the error message for these special cases 
> to:
> {code:java}
> Datatype char requires a length parameter, for example char(10). Please 
> specify the length.{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38650) Better ParseException message for char without length

2022-03-24 Thread Xinyi Yu (Jira)
Xinyi Yu created SPARK-38650:


 Summary: Better ParseException message for char without length
 Key: SPARK-38650
 URL: https://issues.apache.org/jira/browse/SPARK-38650
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Xinyi Yu


We support char and varchar types. But when users input the type without 
length, the message is confusing and not helpful at all:
{code:sql}
> SELECT cast('a' as CHAR)

DataType char is not supported.(line 1, pos 19)

== SQL ==
SELECT cast('a' AS CHAR)
---^^^{code}
This ticket would like to improve the error message for these special cases to:
{code:java}
Datatype char requires a length parameter, for example char(10). Please specify 
the length.{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38641) Get rid of invalid configuration elements in mvn_scalafmt in main pom.xml

2022-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-38641.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35956
[https://github.com/apache/spark/pull/35956]

> Get rid of invalid configuration elements in mvn_scalafmt in main pom.xml
> -
>
> Key: SPARK-38641
> URL: https://issues.apache.org/jira/browse/SPARK-38641
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.1
>Reporter: morvenhuang
>Assignee: morvenhuang
>Priority: Trivial
> Fix For: 3.4.0
>
> Attachments: mvn_scalafmt.jpg
>
>
> After loading latest spark code into IntelliJ IDEA, it complains that 
> configuration 'parameters' and 'skip' under mvn_scalafmt plugin are not 
> allowed, see screenshot attached for details. 
>  
> I've contacted the author of mvn_scalafmt, Ciaran Kearney, to confirm if 
> these 2 configuration items are no longer there since v 1.0.0, here's his 
> return, 
>  
> {quote}That's correct.  The command line parameters were removed by scalafmt 
> itself a few versions ago and skip was replaced by validateOnly (which checks 
> formatting without changing files.
> {quote}
>  
> I think we should get rid of the 'parameters' since it's invalid , and 
> replace 'skip' with 'validateOnly' as Ciaran said.
>  
> I can make a quick fix for this.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38641) Get rid of invalid configuration elements in mvn_scalafmt in main pom.xml

2022-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-38641:


Assignee: morvenhuang

> Get rid of invalid configuration elements in mvn_scalafmt in main pom.xml
> -
>
> Key: SPARK-38641
> URL: https://issues.apache.org/jira/browse/SPARK-38641
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.1
>Reporter: morvenhuang
>Assignee: morvenhuang
>Priority: Trivial
> Attachments: mvn_scalafmt.jpg
>
>
> After loading latest spark code into IntelliJ IDEA, it complains that 
> configuration 'parameters' and 'skip' under mvn_scalafmt plugin are not 
> allowed, see screenshot attached for details. 
>  
> I've contacted the author of mvn_scalafmt, Ciaran Kearney, to confirm if 
> these 2 configuration items are no longer there since v 1.0.0, here's his 
> return, 
>  
> {quote}That's correct.  The command line parameters were removed by scalafmt 
> itself a few versions ago and skip was replaced by validateOnly (which checks 
> formatting without changing files.
> {quote}
>  
> I think we should get rid of the 'parameters' since it's invalid , and 
> replace 'skip' with 'validateOnly' as Ciaran said.
>  
> I can make a quick fix for this.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2022-03-24 Thread Stu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511358#comment-17511358
 ] 

Stu edited comment on SPARK-26639 at 3/24/22, 10:13 PM:


Here's another example of this happening, in Spark 3.1.2. I'm running the 
following code:
{code:java}
WITH t AS (
  SELECT random() as a
) 
  SELECT * FROM t
  UNION
  SELECT * FROM t {code}
The CTE has a non-deterministic function. If it was pre-calculated, the same 
random value would be chosen for `a` in both unioned queries, and the output 
would be deduplicated into a single record.

This is not the case. The output is two records, with different random values.

In our platform, some folks like to write complex CTEs and reference them 
multiple times. Recalculating these for every reference is quite 
computationally expensive, so we recommend to create separate tables in these 
cases, but don't have any way to enforce this. Fixing this bug would save a 
good number of compute hours!


was (Author: stubartmess):
Here's another example of this happening, in Spark 3.1.2. I'm running the 
following code:
{code:java}
WITH t AS (
  SELECT random() as a
) 
  SELECT * FROM t
  UNION
  SELECT * FROM t {code}
The CTE has a non-deterministic function. If it was pre-calculated, the same 
random value would be chosen for `a` in both unioned queries, and the output 
would be deduplicated into a single record.

This is not the case. The output is two records, with different random values.

> The reuse subquery function maybe does not work in SPARK SQL
> 
>
> Key: SPARK-26639
> URL: https://issues.apache.org/jira/browse/SPARK-26639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ke Jia
>Priority: Major
>
> The subquery reuse feature has done in 
> [https://github.com/apache/spark/pull/14548]
> In my test, I found the visualized plan do show the subquery is executed 
> once. But the stage of same subquery execute maybe not once.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38649) Fix SECURITY.md

2022-03-24 Thread Jira
Bjørn Jørgensen created SPARK-38649:
---

 Summary: Fix SECURITY.md
 Key: SPARK-38649
 URL: https://issues.apache.org/jira/browse/SPARK-38649
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 3.4.0
Reporter: Bjørn Jørgensen


At [Github Security -> Security 
policy|https://github.com/apache/spark/security/policy] 
The info there does not tell users what to do, if they have found a security 
issue.

The default text for this page is 

 
"
# Security Policy

## Supported Versions

Use this section to tell people about which versions of your project are
currently being supported with security updates.

| Version | Supported  |
| --- | -- |
| 5.1.x   | :white_check_mark: |
| 5.0.x   | :x:|
| 4.0.x   | :white_check_mark: |
| < 4.0   | :x:|

## Reporting a Vulnerability

Use this section to tell people how to report a vulnerability.

Tell them where to go, how often they can expect to get an update on a
reported vulnerability, what to expect if the vulnerability is accepted or
declined, etc.
"

We should change this to something like:

"
Reporting security issues
Apache Spark uses the standard process outlined by the Apache Security Team for 
reporting vulnerabilities. Note that vulnerabilities should not be publicly 
disclosed until the project has responded.

To report a possible security vulnerability, please email 
secur...@spark.apache.org. This is a non-public list that will reach the Apache 
Security team, as well as the Spark PMC.

For more info https://spark.apache.org/security.html 
"
  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38647) Add SupportsReportOrdering mix in interface for Scan

2022-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38647:


Assignee: (was: Apache Spark)

> Add SupportsReportOrdering mix in interface for Scan
> 
>
> Key: SPARK-38647
> URL: https://issues.apache.org/jira/browse/SPARK-38647
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Enrico Minack
>Priority: Major
>
> As {{SupportsReportPartitioning}} allows implementations of {{Scan}} provide 
> Spark with information about the exiting partitioning of data read by a 
> {{{}DataSourceV2{}}}, a similar mix in interface {{SupportsReportOrdering}} 
> should provide order information.
> This prevents Spark from sorting data if they already exhibit a certain order 
> provided by the source.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38647) Add SupportsReportOrdering mix in interface for Scan

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511943#comment-17511943
 ] 

Apache Spark commented on SPARK-38647:
--

User 'EnricoMi' has created a pull request for this issue:
https://github.com/apache/spark/pull/35965

> Add SupportsReportOrdering mix in interface for Scan
> 
>
> Key: SPARK-38647
> URL: https://issues.apache.org/jira/browse/SPARK-38647
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Enrico Minack
>Priority: Major
>
> As {{SupportsReportPartitioning}} allows implementations of {{Scan}} provide 
> Spark with information about the exiting partitioning of data read by a 
> {{{}DataSourceV2{}}}, a similar mix in interface {{SupportsReportOrdering}} 
> should provide order information.
> This prevents Spark from sorting data if they already exhibit a certain order 
> provided by the source.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38647) Add SupportsReportOrdering mix in interface for Scan

2022-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38647:


Assignee: Apache Spark

> Add SupportsReportOrdering mix in interface for Scan
> 
>
> Key: SPARK-38647
> URL: https://issues.apache.org/jira/browse/SPARK-38647
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Enrico Minack
>Assignee: Apache Spark
>Priority: Major
>
> As {{SupportsReportPartitioning}} allows implementations of {{Scan}} provide 
> Spark with information about the exiting partitioning of data read by a 
> {{{}DataSourceV2{}}}, a similar mix in interface {{SupportsReportOrdering}} 
> should provide order information.
> This prevents Spark from sorting data if they already exhibit a certain order 
> provided by the source.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38438) Can't update spark.jars.packages on existing global/default context

2022-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-38438:
-
Issue Type: Improvement  (was: Bug)
  Priority: Minor  (was: Major)

This is not a bug. As discussed in email, this kind of thing raises so many 
questions of semantics when classes are updated or unloaded that I think it 
won't happen.

> Can't update spark.jars.packages on existing global/default context
> ---
>
> Key: SPARK-38438
> URL: https://issues.apache.org/jira/browse/SPARK-38438
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.2.1
> Environment: py: 3.9
> spark: 3.2.1
>Reporter: Rafal Wojdyla
>Priority: Minor
>
> Reproduction:
> {code:python}
> from pyspark.sql import SparkSession
> # default session:
> s = SparkSession.builder.getOrCreate()
> # later on we want to update jars.packages, here's e.g. spark-hats
> s = (SparkSession.builder
>  .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
>  .getOrCreate())
> # line below returns None, the config was not propagated:
> s._sc._conf.get("spark.jars.packages")
> {code}
> Stopping the context doesn't help, in fact it's even more confusing, because 
> the configuration is updated, but doesn't have an effect:
> {code:python}
> from pyspark.sql import SparkSession
> # default session:
> s = SparkSession.builder.getOrCreate()
> s.stop()
> s = (SparkSession.builder
>  .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
>  .getOrCreate())
> # now this line returns 'za.co.absa:spark-hats_2.12:0.2.2', but the context
> # doesn't download the jar/package, as it would if there was no global context
> # thus the extra package is unusable. It's not downloaded, or added to the
> # classpath.
> s._sc._conf.get("spark.jars.packages")
> {code}
> One workaround is to stop the context AND kill the JVM gateway, which seems 
> to be a kind of hard reset:
> {code:python}
> from pyspark import SparkContext
> from pyspark.sql import SparkSession
> # default session:
> s = SparkSession.builder.getOrCreate()
> # Hard reset:
> s.stop()
> s._sc._gateway.shutdown()
> s._sc._gateway.proc.stdin.close()
> SparkContext._gateway = None
> SparkContext._jvm = None
> s = (SparkSession.builder
>  .config("spark.jars.packages", "za.co.absa:spark-hats_2.12:0.2.2")
>  .getOrCreate())
> # Now we are guaranteed there's a new spark session, and packages
> # are downloaded, added to the classpath etc.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38346) Add cache in MLlib BinaryClassificationMetrics

2022-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-38346.
--
Resolution: Won't Fix

> Add cache in MLlib BinaryClassificationMetrics
> --
>
> Key: SPARK-38346
> URL: https://issues.apache.org/jira/browse/SPARK-38346
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.3
> Environment: Windows10/macOS12.2; spark_2.11-2.2.3; 
> mmlspark_2.11-0.18.0; lightgbmlib-2.2.350
>Reporter: Mingchao Wu
>Priority: Minor
>
> we run some example code use BinaryClassificationEvaluator in MLlib, found 
> that ShuffledRDD[28] at BinaryClassificationMetrics.scala:155 and 
> UnionRDD[36] BinaryClassificationMetrics.scala:90 were used more than once 
> but not cached.
> We use spark-2.2.3 and found the code in branch master is still without 
> cache, so we hope to improve it.
> The example code is as follow:
> {code:java}
> import com.microsoft.ml.spark.lightgbm.LightGBMRegressor
> import org.apache.spark.ml.Pipeline
> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
> import org.apache.spark.ml.feature.VectorAssembler
> import org.apache.spark.sql.types.{DoubleType, IntegerType}
> import org.apache.spark.sql.{DataFrame, SparkSession}
> object LightGBMRegressorTest {  def main(args: Array[String]): Unit = {    
> val spark: SparkSession = SparkSession.builder()
>       .appName("LightGBMRegressorTest")
>       .master("local[*]")
>       .getOrCreate()    val startTime = System.currentTimeMillis()    var 
> originalData: DataFrame = spark.read.option("header", "true")
>       .option("inferSchema", "true")
>       .csv("data/hour.csv")    val labelCol = "workingday"
>     val cateCols = Array("season", "yr", "mnth", "hr")
>     val conCols: Array[String] = Array("temp", "atemp", "hum", "casual", 
> "cnt")
>     val vecCols = conCols ++ cateCols    import spark.implicits._
>     vecCols.foreach(col => {
>       originalData = originalData.withColumn(col, $"$col".cast(DoubleType))
>     })
>     originalData = originalData.withColumn(labelCol, 
> $"$labelCol".cast(IntegerType))    val assembler = new 
> VectorAssembler().setInputCols(vecCols).setOutputCol("features")    val 
> classifier: LightGBMRegressor = new 
> LightGBMRegressor().setNumIterations(100).setNumLeaves(31)
>       
> .setBoostFromAverage(false).setFeatureFraction(1.0).setMaxDepth(-1).setMaxBin(255)
>       
> .setLearningRate(0.1).setMinSumHessianInLeaf(0.001).setLambdaL1(0.0).setLambdaL2(0.0)
>       
> .setBaggingFraction(0.5).setBaggingFreq(1).setBaggingSeed(1).setObjective("binary")
>       
> .setLabelCol(labelCol).setCategoricalSlotNames(cateCols).setFeaturesCol("features")
>       .setBoostingType("gbdt")    val pipeline: Pipeline = new 
> Pipeline().setStages(Array(assembler, classifier))    val Array(tr, te) = 
> originalData.randomSplit(Array(0.7, .03), 666)
>     val model = pipeline.fit(tr)
>     val modelDF = model.transform(te)
>     val evaluator = new 
> BinaryClassificationEvaluator().setLabelCol(labelCol).setRawPredictionCol("prediction")
>     println(evaluator.evaluate(modelDF))    println(s"time: 
> ${System.currentTimeMillis() - startTime}" )
>     System.in.read()
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38210) Spark documentation build README is stale

2022-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-38210.
--
Resolution: Not A Problem

> Spark documentation build README is stale
> -
>
> Key: SPARK-38210
> URL: https://issues.apache.org/jira/browse/SPARK-38210
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Khalid Mammadov
>Priority: Minor
>
> I was following docs/README.md to build documentation and found out that it's 
> not complete. I had to install additional packages that is not documented but 
> available in the [CI/CD phase 
> |https://github.com/apache/spark/blob/c8b34ab7340265f1f2bec2afa694c10f174b222c/.github/workflows/build_and_test.yml#L526]and
>  few more to finish the build process.
> I will file a PR to change README.md to include these packages and improve 
> the guide



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38202) Invalid URL in SparkContext.addedJars will constantly fails Executor.run()

2022-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-38202.
--
Resolution: Not A Problem

> Invalid URL in SparkContext.addedJars will constantly fails Executor.run()
> --
>
> Key: SPARK-38202
> URL: https://issues.apache.org/jira/browse/SPARK-38202
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Bo Zhang
>Priority: Major
>
> When an invalid URL is used in SparkContext.addJar(), all subsequent query 
> executions will fail since downloading the jar is in the critical path of 
> Executor.run(), even when the query has noting to do with the jar. 
> A simple reproduce of the issue:
> {code:java}
> sc.addJar("http://invalid/library.jar;)
> (0 to 1).toDF.count
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37532) RDD name could be very long and memory costly

2022-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-37532.
--
Resolution: Won't Fix

> RDD name could be very long and memory costly
> -
>
> Key: SPARK-37532
> URL: https://issues.apache.org/jira/browse/SPARK-37532
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Kent Yao
>Priority: Minor
>
> take sc.newHadoopFile for an example, the path parameter can be a very long 
> string and turn to a very unfriendly name both for the UI or driver memory



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38648) SPIP: Simplified API for DL Inferencing

2022-03-24 Thread Lee Yang (Jira)
Lee Yang created SPARK-38648:


 Summary: SPIP: Simplified API for DL Inferencing
 Key: SPARK-38648
 URL: https://issues.apache.org/jira/browse/SPARK-38648
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: Lee Yang


h1. Background and Motivation

The deployment of deep learning (DL) models to Spark clusters can be a point of 
friction today.  DL practitioners often aren't well-versed with Spark, and 
Spark experts often aren't well-versed with the fast-changing DL frameworks.  
Currently, the deployment of trained DL models is done in a fairly ad-hoc 
manner, with each model integration usually requiring significant effort.

To simplify this process, we propose adding an integration layer for each major 
DL framework that can introspect their respective saved models to more-easily 
integrate these models into Spark applications.  You can find a detailed 
proposal 
[here|https://docs.google.com/document/d/1n7QPHVZfmQknvebZEXxzndHPV2T71aBsDnP4COQa_v0]

h1. Goals

- Simplify the deployment of trained single-node DL models to Spark inference 
applications.
- Follow pandas_udf for simple inference use-cases.
- Follow Spark ML Pipelines APIs for transfer-learning use-cases.
- Enable integrations with popular third-party DL frameworks like TensorFlow, 
PyTorch, and Huggingface.
- Focus on PySpark, since most of the DL frameworks use Python.
- Take advantage of built-in Spark features like GPU scheduling and Arrow 
integration.
- Enable inference on both CPU and GPU.

h1. Non-goals

- DL model training.
- Inference w/ distributed models, i.e. "model parallel" inference.

h1. Target Personas

- Data scientists who need to deploy DL models on Spark.
- Developers who need to deploy DL models on Spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38647) Add SupportsReportOrdering mix in interface for Scan

2022-03-24 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-38647:
--
Description: 
As {{SupportsReportPartitioning}} allows implementations of {{Scan}} provide 
Spark with information about the exiting partitioning of data read by a 
{{{}DataSourceV2{}}}, a similar mix in interface {{SupportsReportOrdering}} 
should provide order information.

This prevents Spark from sorting data if they already exhibit a certain order 
provided by the source.

  was:
As {{SupportsReportPartitioning}} allows implementations of {{Scan}} provide 
Spark with information about the exiting partitioning of data read by a 
{{DataSourceV2}}, a similar mix in interface {{SupportsReportOrdering}} should 
provide order information.

This prevents Spark from sorting data if they exhibit a certain order provided 
by the source.


> Add SupportsReportOrdering mix in interface for Scan
> 
>
> Key: SPARK-38647
> URL: https://issues.apache.org/jira/browse/SPARK-38647
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Enrico Minack
>Priority: Major
>
> As {{SupportsReportPartitioning}} allows implementations of {{Scan}} provide 
> Spark with information about the exiting partitioning of data read by a 
> {{{}DataSourceV2{}}}, a similar mix in interface {{SupportsReportOrdering}} 
> should provide order information.
> This prevents Spark from sorting data if they already exhibit a certain order 
> provided by the source.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38647) Add SupportsReportOrdering mix in interface for Scan

2022-03-24 Thread Enrico Minack (Jira)
Enrico Minack created SPARK-38647:
-

 Summary: Add SupportsReportOrdering mix in interface for Scan
 Key: SPARK-38647
 URL: https://issues.apache.org/jira/browse/SPARK-38647
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.0
Reporter: Enrico Minack


As {{SupportsReportPartitioning}} allows implementations of {{Scan}} provide 
Spark with information about the exiting partitioning of data read by a 
{{DataSourceV2}}, a similar mix in interface {{SupportsReportOrdering}} should 
provide order information.

This prevents Spark from sorting data if they exhibit a certain order provided 
by the source.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37463) Read/Write Timestamp ntz from/to Orc uses int64

2022-03-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37463.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34984
[https://github.com/apache/spark/pull/34984]

> Read/Write Timestamp ntz from/to Orc uses int64
> ---
>
> Key: SPARK-37463
> URL: https://issues.apache.org/jira/browse/SPARK-37463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>
> There are some example code:
> import java.util.TimeZone
> TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
> sql("set spark.sql.session.timeZone=America/Los_Angeles")
> val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp 
> '2021-06-01 00:00:00' ts")
> df.write.mode("overwrite").orc("ts_ntz_orc")
> df.write.mode("overwrite").parquet("ts_ntz_parquet")
> df.write.mode("overwrite").format("avro").save("ts_ntz_avro")
> val query = """
>   select 'orc', *
>   from `orc`.`ts_ntz_orc`
>   union all
>   select 'parquet', *
>   from `parquet`.`ts_ntz_parquet`
>   union all
>   select 'avro', *
>   from `avro`.`ts_ntz_avro`
> """
> val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam")
> for (tz <- tzs) {
>   TimeZone.setDefault(TimeZone.getTimeZone(tz))
>   sql(s"set spark.sql.session.timeZone=$tz")
>   println(s"Time zone is ${TimeZone.getDefault.getID}")
>   sql(query).show(false)
> }
> The output show below looks so strange.
> Time zone is America/Los_Angeles
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-06-01 00:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 00:00:00|
> +---+---+---+
> Time zone is UTC
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-05-31 17:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 07:00:00|
> +---+---+---+
> Time zone is Europe/Amsterdam
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-05-31 15:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 09:00:00|
> +---+---+---+



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37463) Read/Write Timestamp ntz from/to Orc uses int64

2022-03-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37463:
---

Assignee: jiaan.geng

> Read/Write Timestamp ntz from/to Orc uses int64
> ---
>
> Key: SPARK-37463
> URL: https://issues.apache.org/jira/browse/SPARK-37463
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> There are some example code:
> import java.util.TimeZone
> TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
> sql("set spark.sql.session.timeZone=America/Los_Angeles")
> val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp 
> '2021-06-01 00:00:00' ts")
> df.write.mode("overwrite").orc("ts_ntz_orc")
> df.write.mode("overwrite").parquet("ts_ntz_parquet")
> df.write.mode("overwrite").format("avro").save("ts_ntz_avro")
> val query = """
>   select 'orc', *
>   from `orc`.`ts_ntz_orc`
>   union all
>   select 'parquet', *
>   from `parquet`.`ts_ntz_parquet`
>   union all
>   select 'avro', *
>   from `avro`.`ts_ntz_avro`
> """
> val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam")
> for (tz <- tzs) {
>   TimeZone.setDefault(TimeZone.getTimeZone(tz))
>   sql(s"set spark.sql.session.timeZone=$tz")
>   println(s"Time zone is ${TimeZone.getDefault.getID}")
>   sql(query).show(false)
> }
> The output show below looks so strange.
> Time zone is America/Los_Angeles
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-06-01 00:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 00:00:00|
> +---+---+---+
> Time zone is UTC
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-05-31 17:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 07:00:00|
> +---+---+---+
> Time zone is Europe/Amsterdam
> +---+---+---+
> |orc|ts_ntz |ts |
> +---+---+---+
> |orc|2021-05-31 15:00:00|2021-06-01 00:00:00|
> |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00|
> |avro   |2021-06-01 00:00:00|2021-06-01 09:00:00|
> +---+---+---+



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38646) Pull a trait out for Python functions

2022-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38646:


Assignee: (was: Apache Spark)

> Pull a trait out for Python functions
> -
>
> Key: SPARK-38646
> URL: https://issues.apache.org/jira/browse/SPARK-38646
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Zhen Li
>Priority: Major
>
> Currently pyspark uses a case class PythonFunction PythonRDD and many other 
> interfaces/classes. Propose to change to use a trait instead to avoid tying 
> impl with APIs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38646) Pull a trait out for Python functions

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511886#comment-17511886
 ] 

Apache Spark commented on SPARK-38646:
--

User 'zhenlineo' has created a pull request for this issue:
https://github.com/apache/spark/pull/35964

> Pull a trait out for Python functions
> -
>
> Key: SPARK-38646
> URL: https://issues.apache.org/jira/browse/SPARK-38646
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Zhen Li
>Priority: Major
>
> Currently pyspark uses a case class PythonFunction PythonRDD and many other 
> interfaces/classes. Propose to change to use a trait instead to avoid tying 
> impl with APIs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38646) Pull a trait out for Python functions

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511887#comment-17511887
 ] 

Apache Spark commented on SPARK-38646:
--

User 'zhenlineo' has created a pull request for this issue:
https://github.com/apache/spark/pull/35964

> Pull a trait out for Python functions
> -
>
> Key: SPARK-38646
> URL: https://issues.apache.org/jira/browse/SPARK-38646
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Zhen Li
>Priority: Major
>
> Currently pyspark uses a case class PythonFunction PythonRDD and many other 
> interfaces/classes. Propose to change to use a trait instead to avoid tying 
> impl with APIs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38646) Pull a trait out for Python functions

2022-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38646:


Assignee: Apache Spark

> Pull a trait out for Python functions
> -
>
> Key: SPARK-38646
> URL: https://issues.apache.org/jira/browse/SPARK-38646
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0, 3.2.2
>Reporter: Zhen Li
>Assignee: Apache Spark
>Priority: Major
>
> Currently pyspark uses a case class PythonFunction PythonRDD and many other 
> interfaces/classes. Propose to change to use a trait instead to avoid tying 
> impl with APIs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37568) Support 2-arguments by the convert_timezone() function

2022-03-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-37568:
-
Fix Version/s: 3.3.0

> Support 2-arguments by the convert_timezone() function
> --
>
> Key: SPARK-37568
> URL: https://issues.apache.org/jira/browse/SPARK-37568
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> # If sourceTs is a timestamp_ntz, take the sourceTz from the session time 
> zone, see the SQL config spark.sql.session.timeZone
> # If sourceTs is a timestamp_ltz, convert it to a timestamp_ntz using the 
> targetTz



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38646) Pull a trait out for Python functions

2022-03-24 Thread Zhen Li (Jira)
Zhen Li created SPARK-38646:
---

 Summary: Pull a trait out for Python functions
 Key: SPARK-38646
 URL: https://issues.apache.org/jira/browse/SPARK-38646
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0, 3.2.2
Reporter: Zhen Li


Currently pyspark uses a case class PythonFunction PythonRDD and many other 
interfaces/classes. Propose to change to use a trait instead to avoid tying 
impl with APIs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38639) Support ignoreCorruptRecord flag to ensure querying broken sequence file table smoothly

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511874#comment-17511874
 ] 

Apache Spark commented on SPARK-38639:
--

User 'TonyDoen' has created a pull request for this issue:
https://github.com/apache/spark/pull/35963

> Support ignoreCorruptRecord flag to ensure querying broken sequence file 
> table smoothly
> ---
>
> Key: SPARK-38639
> URL: https://issues.apache.org/jira/browse/SPARK-38639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
>Reporter: tonydoen
>Priority: Minor
> Fix For: 3.2.1
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> There's an existing flag "spark.sql.files.ignoreCorruptFiles" and 
> "spark.sql.files.ignoreMissingFiles" that will quietly ignore attempted reads 
> from files that have been corrupted, but it still allows the query to fail on 
> sequence files.
>  
> Being able to ignore corrupt record is useful in the scenarios that users 
> want to query successfully in dirty data(mixed schema in one table).
>  
> We would like to add a "spark.sql.hive.ignoreCorruptRecord"  to fill out the 
> functionality.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38639) Support ignoreCorruptRecord flag to ensure querying broken sequence file table smoothly

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511873#comment-17511873
 ] 

Apache Spark commented on SPARK-38639:
--

User 'TonyDoen' has created a pull request for this issue:
https://github.com/apache/spark/pull/35963

> Support ignoreCorruptRecord flag to ensure querying broken sequence file 
> table smoothly
> ---
>
> Key: SPARK-38639
> URL: https://issues.apache.org/jira/browse/SPARK-38639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
>Reporter: tonydoen
>Priority: Minor
> Fix For: 3.2.1
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> There's an existing flag "spark.sql.files.ignoreCorruptFiles" and 
> "spark.sql.files.ignoreMissingFiles" that will quietly ignore attempted reads 
> from files that have been corrupted, but it still allows the query to fail on 
> sequence files.
>  
> Being able to ignore corrupt record is useful in the scenarios that users 
> want to query successfully in dirty data(mixed schema in one table).
>  
> We would like to add a "spark.sql.hive.ignoreCorruptRecord"  to fill out the 
> functionality.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38645) Support `spark.sql.codegen.cleanedSourcePrint` flag to print Codegen cleanedSource

2022-03-24 Thread tonydoen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511858#comment-17511858
 ] 

tonydoen commented on SPARK-38645:
--

related pr : [https://github.com/apache/spark/pull/35962]

> Support `spark.sql.codegen.cleanedSourcePrint` flag to print Codegen 
> cleanedSource
> --
>
> Key: SPARK-38645
> URL: https://issues.apache.org/jira/browse/SPARK-38645
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: tonydoen
>Priority: Trivial
> Fix For: 3.2.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When we use spark-sql,  encountering problems in codegen source, we often 
> have to change the log level to DEBUG, but there are too many logs in this 
> mode (DEBUG) .
>  
> Then `spark.sql.codegen.cleanedSourcePrint` can ensure that just printing 
> codegen source.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38639) Support ignoreCorruptRecord flag to ensure querying broken sequence file table smoothly

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511855#comment-17511855
 ] 

Apache Spark commented on SPARK-38639:
--

User 'TonyDoen' has created a pull request for this issue:
https://github.com/apache/spark/pull/35962

> Support ignoreCorruptRecord flag to ensure querying broken sequence file 
> table smoothly
> ---
>
> Key: SPARK-38639
> URL: https://issues.apache.org/jira/browse/SPARK-38639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
>Reporter: tonydoen
>Priority: Minor
> Fix For: 3.2.1
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> There's an existing flag "spark.sql.files.ignoreCorruptFiles" and 
> "spark.sql.files.ignoreMissingFiles" that will quietly ignore attempted reads 
> from files that have been corrupted, but it still allows the query to fail on 
> sequence files.
>  
> Being able to ignore corrupt record is useful in the scenarios that users 
> want to query successfully in dirty data(mixed schema in one table).
>  
> We would like to add a "spark.sql.hive.ignoreCorruptRecord"  to fill out the 
> functionality.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38639) Support ignoreCorruptRecord flag to ensure querying broken sequence file table smoothly

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511854#comment-17511854
 ] 

Apache Spark commented on SPARK-38639:
--

User 'TonyDoen' has created a pull request for this issue:
https://github.com/apache/spark/pull/35962

> Support ignoreCorruptRecord flag to ensure querying broken sequence file 
> table smoothly
> ---
>
> Key: SPARK-38639
> URL: https://issues.apache.org/jira/browse/SPARK-38639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
>Reporter: tonydoen
>Priority: Minor
> Fix For: 3.2.1
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> There's an existing flag "spark.sql.files.ignoreCorruptFiles" and 
> "spark.sql.files.ignoreMissingFiles" that will quietly ignore attempted reads 
> from files that have been corrupted, but it still allows the query to fail on 
> sequence files.
>  
> Being able to ignore corrupt record is useful in the scenarios that users 
> want to query successfully in dirty data(mixed schema in one table).
>  
> We would like to add a "spark.sql.hive.ignoreCorruptRecord"  to fill out the 
> functionality.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38645) Support `spark.sql.codegen.cleanedSourcePrint` flag to print Codegen cleanedSource

2022-03-24 Thread tonydoen (Jira)
tonydoen created SPARK-38645:


 Summary: Support `spark.sql.codegen.cleanedSourcePrint` flag to 
print Codegen cleanedSource
 Key: SPARK-38645
 URL: https://issues.apache.org/jira/browse/SPARK-38645
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.1
Reporter: tonydoen
 Fix For: 3.2.1


When we use spark-sql,  encountering problems in codegen source, we often have 
to change the log level to DEBUG, but there are too many logs in this mode 
(DEBUG) .

 

Then `spark.sql.codegen.cleanedSourcePrint` can ensure that just printing 
codegen source.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38644) DS V2 topN push-down supports project with alias

2022-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38644:


Assignee: (was: Apache Spark)

> DS V2 topN push-down supports project with alias
> 
>
> Key: SPARK-38644
> URL: https://issues.apache.org/jira/browse/SPARK-38644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38644) DS V2 topN push-down supports project with alias

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511846#comment-17511846
 ] 

Apache Spark commented on SPARK-38644:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/35961

> DS V2 topN push-down supports project with alias
> 
>
> Key: SPARK-38644
> URL: https://issues.apache.org/jira/browse/SPARK-38644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38644) DS V2 topN push-down supports project with alias

2022-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38644:


Assignee: Apache Spark

> DS V2 topN push-down supports project with alias
> 
>
> Key: SPARK-38644
> URL: https://issues.apache.org/jira/browse/SPARK-38644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38644) DS V2 topN push-down supports project with alias

2022-03-24 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-38644:
---
Summary: DS V2 topN push-down supports project with alias  (was: DS V2 
aggregate push-down supports project with alias)

> DS V2 topN push-down supports project with alias
> 
>
> Key: SPARK-38644
> URL: https://issues.apache.org/jira/browse/SPARK-38644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38644) DS V2 aggregate push-down supports project with alias

2022-03-24 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-38644:
---
Issue Type: Improvement  (was: New Feature)

> DS V2 aggregate push-down supports project with alias
> -
>
> Key: SPARK-38644
> URL: https://issues.apache.org/jira/browse/SPARK-38644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38644) DS V2 aggregate push-down supports project with alias

2022-03-24 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-38644:
--

 Summary: DS V2 aggregate push-down supports project with alias
 Key: SPARK-38644
 URL: https://issues.apache.org/jira/browse/SPARK-38644
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.4.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38461) Use error classes in org.apache.spark.broadcast

2022-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38461:


Assignee: (was: Apache Spark)

> Use error classes in org.apache.spark.broadcast
> ---
>
> Key: SPARK-38461
> URL: https://issues.apache.org/jira/browse/SPARK-38461
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Bo Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38461) Use error classes in org.apache.spark.broadcast

2022-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38461:


Assignee: Apache Spark

> Use error classes in org.apache.spark.broadcast
> ---
>
> Key: SPARK-38461
> URL: https://issues.apache.org/jira/browse/SPARK-38461
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Bo Zhang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38461) Use error classes in org.apache.spark.broadcast

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511834#comment-17511834
 ] 

Apache Spark commented on SPARK-38461:
--

User 'bozhang2820' has created a pull request for this issue:
https://github.com/apache/spark/pull/35960

> Use error classes in org.apache.spark.broadcast
> ---
>
> Key: SPARK-38461
> URL: https://issues.apache.org/jira/browse/SPARK-38461
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Bo Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2022-03-24 Thread hujiahua (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511815#comment-17511815
 ] 

hujiahua commented on SPARK-18105:
--

It's working in my case by setting spark.file.transferTo=false. Thanks to 
[~zhangweilst] . And my spark version was 3.1.2.

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Davies Liu
>Priority: Major
> Attachments: TestWeightedGraph.java
>
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38640) NPE with unpersisting memory-only RDD with RDD fetching from shuffle service enabled

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511809#comment-17511809
 ] 

Apache Spark commented on SPARK-38640:
--

User 'Kimahriman' has created a pull request for this issue:
https://github.com/apache/spark/pull/35959

> NPE with unpersisting memory-only RDD with RDD fetching from shuffle service 
> enabled
> 
>
> Key: SPARK-38640
> URL: https://issues.apache.org/jira/browse/SPARK-38640
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Adam Binford
>Priority: Major
>
> If you have RDD fetching from shuffle service enabled, memory-only cached 
> RDDs will fail to unpersist.
>  
>  
> {code:java}
> // spark.shuffle.service.fetch.rdd.enabled=true
> val df = spark.range(5)
>   .persist(StorageLevel.MEMORY_ONLY)
> df.count()
> df.unpersist(true)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38640) NPE with unpersisting memory-only RDD with RDD fetching from shuffle service enabled

2022-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38640:


Assignee: (was: Apache Spark)

> NPE with unpersisting memory-only RDD with RDD fetching from shuffle service 
> enabled
> 
>
> Key: SPARK-38640
> URL: https://issues.apache.org/jira/browse/SPARK-38640
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Adam Binford
>Priority: Major
>
> If you have RDD fetching from shuffle service enabled, memory-only cached 
> RDDs will fail to unpersist.
>  
>  
> {code:java}
> // spark.shuffle.service.fetch.rdd.enabled=true
> val df = spark.range(5)
>   .persist(StorageLevel.MEMORY_ONLY)
> df.count()
> df.unpersist(true)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38640) NPE with unpersisting memory-only RDD with RDD fetching from shuffle service enabled

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511810#comment-17511810
 ] 

Apache Spark commented on SPARK-38640:
--

User 'Kimahriman' has created a pull request for this issue:
https://github.com/apache/spark/pull/35959

> NPE with unpersisting memory-only RDD with RDD fetching from shuffle service 
> enabled
> 
>
> Key: SPARK-38640
> URL: https://issues.apache.org/jira/browse/SPARK-38640
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Adam Binford
>Priority: Major
>
> If you have RDD fetching from shuffle service enabled, memory-only cached 
> RDDs will fail to unpersist.
>  
>  
> {code:java}
> // spark.shuffle.service.fetch.rdd.enabled=true
> val df = spark.range(5)
>   .persist(StorageLevel.MEMORY_ONLY)
> df.count()
> df.unpersist(true)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38640) NPE with unpersisting memory-only RDD with RDD fetching from shuffle service enabled

2022-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38640:


Assignee: Apache Spark

> NPE with unpersisting memory-only RDD with RDD fetching from shuffle service 
> enabled
> 
>
> Key: SPARK-38640
> URL: https://issues.apache.org/jira/browse/SPARK-38640
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Adam Binford
>Assignee: Apache Spark
>Priority: Major
>
> If you have RDD fetching from shuffle service enabled, memory-only cached 
> RDDs will fail to unpersist.
>  
>  
> {code:java}
> // spark.shuffle.service.fetch.rdd.enabled=true
> val df = spark.range(5)
>   .persist(StorageLevel.MEMORY_ONLY)
> df.count()
> df.unpersist(true)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38627) TypeError: Datetime subtraction can only be applied to datetime series

2022-03-24 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511790#comment-17511790
 ] 

Hyukjin Kwon commented on SPARK-38627:
--

I used Mac. I haven't tested it Spark 3.2. Can you show the full error message?

> TypeError: Datetime subtraction can only be applied to datetime series
> --
>
> Key: SPARK-38627
> URL: https://issues.apache.org/jira/browse/SPARK-38627
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Prakhar Sandhu
>Priority: Major
>
> I am trying to replace pandas with pyspark.pandas library, when I tried this :
> pdf is a pyspark.pandas dataframe
> {code:java}
> pdf["date_diff"] = (pdf["date1"] - pdf["date2"])/pdf.Timedelta(days=30){code}
> I got the below error :
> {code:java}
> File 
> "C:\Users\abc\Anaconda3\envs\test\lib\site-packages\pyspark\pandas\data_type_ops\datetime_ops.py",
>  line 75, in sub
> raise TypeError("Datetime subtraction can only be applied to datetime 
> series.") {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38636) AttributeError: module 'pyspark.pandas' has no attribute 'Timestamp'

2022-03-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38636.
--
Resolution: Not A Problem

> AttributeError: module 'pyspark.pandas' has no attribute 'Timestamp'
> 
>
> Key: SPARK-38636
> URL: https://issues.apache.org/jira/browse/SPARK-38636
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Prakhar Sandhu
>Priority: Major
>
> I am trying to replace pandas library with pyspark.pandas. 
> Tried something like below - 
> {code:java}
> List[pd.Timestamp] {code}
>  But it does not work and instead thrown the below error
> {code:java}
> AttributeError: module 'pyspark.pandas' has no attribute 'Timestamp'{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38643) Validate input dataset of ml.regression

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511771#comment-17511771
 ] 

Apache Spark commented on SPARK-38643:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/35958

> Validate input dataset of ml.regression
> ---
>
> Key: SPARK-38643
> URL: https://issues.apache.org/jira/browse/SPARK-38643
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38643) Validate input dataset of ml.regression

2022-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38643:


Assignee: (was: Apache Spark)

> Validate input dataset of ml.regression
> ---
>
> Key: SPARK-38643
> URL: https://issues.apache.org/jira/browse/SPARK-38643
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38643) Validate input dataset of ml.regression

2022-03-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38643:


Assignee: Apache Spark

> Validate input dataset of ml.regression
> ---
>
> Key: SPARK-38643
> URL: https://issues.apache.org/jira/browse/SPARK-38643
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38643) Validate input dataset of ml.regression

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511770#comment-17511770
 ] 

Apache Spark commented on SPARK-38643:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/35958

> Validate input dataset of ml.regression
> ---
>
> Key: SPARK-38643
> URL: https://issues.apache.org/jira/browse/SPARK-38643
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38643) Validate input dataset of ml.regression

2022-03-24 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-38643:


 Summary: Validate input dataset of ml.regression
 Key: SPARK-38643
 URL: https://issues.apache.org/jira/browse/SPARK-38643
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 3.4.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38627) TypeError: Datetime subtraction can only be applied to datetime series

2022-03-24 Thread Prakhar Sandhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511765#comment-17511765
 ] 

Prakhar Sandhu commented on SPARK-38627:


Hi [~hyukjin.kwon] , Great ^^
 # Did it work on spark 3.3 or spark 3.2?
 # . What environment are you using?

I have set up a conda environment in my local system with spark 3.2. 

I specified the numpy explicitly but got the below error : 

  
{code:java}
 df = pd.DataFrame({ 'Date1': rng.to_numpy(),  'Date2': rng.to_numpy()})
  File 
"C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pyspark\pandas\indexes\base.py",
 line 519, in to_numpy     
    result = np.asarray(self._to_internal_pandas()._values, dtype=dtype)
  File 
"C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pyspark\pandas\indexes\base.py",
 line 472, in _to_internal_pandas
    return self._psdf._internal.to_pandas_frame.index{code}

> TypeError: Datetime subtraction can only be applied to datetime series
> --
>
> Key: SPARK-38627
> URL: https://issues.apache.org/jira/browse/SPARK-38627
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Prakhar Sandhu
>Priority: Major
>
> I am trying to replace pandas with pyspark.pandas library, when I tried this :
> pdf is a pyspark.pandas dataframe
> {code:java}
> pdf["date_diff"] = (pdf["date1"] - pdf["date2"])/pdf.Timedelta(days=30){code}
> I got the below error :
> {code:java}
> File 
> "C:\Users\abc\Anaconda3\envs\test\lib\site-packages\pyspark\pandas\data_type_ops\datetime_ops.py",
>  line 75, in sub
> raise TypeError("Datetime subtraction can only be applied to datetime 
> series.") {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-38627) TypeError: Datetime subtraction can only be applied to datetime series

2022-03-24 Thread Prakhar Sandhu (Jira)


[ https://issues.apache.org/jira/browse/SPARK-38627 ]


Prakhar Sandhu deleted comment on SPARK-38627:


was (Author: JIRAUSER286645):
Hi [~hyukjin.kwon] , Nice ^^
 # Did it work on spark 3.3?
 # What environment are you using?

I have set up a conda environment in my local system with spark 3.2. 

I specified the numpy explicitly

 
{code:java}
 df = pd.DataFrame({ 'Date1': rng.to_numpy,  'Date2': rng.to_numpy})
  File 
"C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pyspark\pandas\frame.py", 
line 519, in __init__
    pdf = pd.DataFrame(data=data, index=index, columns=columns, dtype=dtype, 
copy=copy)
  File 
"C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pandas\core\frame.py", line 
435, in __init__
    mgr = init_dict(data, index, columns, dtype=dtype)
  File 
"C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pandas\core\internals\construction.py",
 line 254, in init_dict
    return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File 
"C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pandas\core\internals\construction.py",
 line 64, in arrays_to_mgr
    index = extract_index(arrays)
  File 
"C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pandas\core\internals\construction.py",
 line 355, in extract_index
    raise ValueError("If using all scalar values, you must pass an index")
ValueError: If using all scalar values, you must pass an index {code}
 

 

 

 

> TypeError: Datetime subtraction can only be applied to datetime series
> --
>
> Key: SPARK-38627
> URL: https://issues.apache.org/jira/browse/SPARK-38627
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Prakhar Sandhu
>Priority: Major
>
> I am trying to replace pandas with pyspark.pandas library, when I tried this :
> pdf is a pyspark.pandas dataframe
> {code:java}
> pdf["date_diff"] = (pdf["date1"] - pdf["date2"])/pdf.Timedelta(days=30){code}
> I got the below error :
> {code:java}
> File 
> "C:\Users\abc\Anaconda3\envs\test\lib\site-packages\pyspark\pandas\data_type_ops\datetime_ops.py",
>  line 75, in sub
> raise TypeError("Datetime subtraction can only be applied to datetime 
> series.") {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38627) TypeError: Datetime subtraction can only be applied to datetime series

2022-03-24 Thread Prakhar Sandhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511763#comment-17511763
 ] 

Prakhar Sandhu commented on SPARK-38627:


Hi [~hyukjin.kwon] , Nice ^^
 # Did it work on spark 3.3?
 # What environment are you using?

I have set up a conda environment in my local system with spark 3.2. 

I specified the numpy explicitly

 
{code:java}
 df = pd.DataFrame({ 'Date1': rng.to_numpy,  'Date2': rng.to_numpy})
  File 
"C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pyspark\pandas\frame.py", 
line 519, in __init__
    pdf = pd.DataFrame(data=data, index=index, columns=columns, dtype=dtype, 
copy=copy)
  File 
"C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pandas\core\frame.py", line 
435, in __init__
    mgr = init_dict(data, index, columns, dtype=dtype)
  File 
"C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pandas\core\internals\construction.py",
 line 254, in init_dict
    return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File 
"C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pandas\core\internals\construction.py",
 line 64, in arrays_to_mgr
    index = extract_index(arrays)
  File 
"C:\Users\abc\Anaconda3\envs\env2\lib\site-packages\pandas\core\internals\construction.py",
 line 355, in extract_index
    raise ValueError("If using all scalar values, you must pass an index")
ValueError: If using all scalar values, you must pass an index {code}
 

 

 

 

> TypeError: Datetime subtraction can only be applied to datetime series
> --
>
> Key: SPARK-38627
> URL: https://issues.apache.org/jira/browse/SPARK-38627
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Prakhar Sandhu
>Priority: Major
>
> I am trying to replace pandas with pyspark.pandas library, when I tried this :
> pdf is a pyspark.pandas dataframe
> {code:java}
> pdf["date_diff"] = (pdf["date1"] - pdf["date2"])/pdf.Timedelta(days=30){code}
> I got the below error :
> {code:java}
> File 
> "C:\Users\abc\Anaconda3\envs\test\lib\site-packages\pyspark\pandas\data_type_ops\datetime_ops.py",
>  line 75, in sub
> raise TypeError("Datetime subtraction can only be applied to datetime 
> series.") {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38627) TypeError: Datetime subtraction can only be applied to datetime series

2022-03-24 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511752#comment-17511752
 ] 

Hyukjin Kwon commented on SPARK-38627:
--

^^ this works. Although you have to explicitly call to_numpy when creating a 
dataframe:

{code}
import pyspark.pandas as pd
import numpy as np

np.random.seed(0)

rng = pd.date_range('2015-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Date1': rng.to_numpy(),  'Date2': rng.to_numpy()}) 

print(df)

df["x"] = df["Date1"] - df["Date2"]

print(df) 
{code}

{code}
Date1   Date2  x
0 2015-02-24 00:00:00 2015-02-24 00:00:00  0
1 2015-02-24 00:01:00 2015-02-24 00:01:00  0
2 2015-02-24 00:02:00 2015-02-24 00:02:00  0
3 2015-02-24 00:03:00 2015-02-24 00:03:00  0
4 2015-02-24 00:04:00 2015-02-24 00:04:00  0
{code}

> TypeError: Datetime subtraction can only be applied to datetime series
> --
>
> Key: SPARK-38627
> URL: https://issues.apache.org/jira/browse/SPARK-38627
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Prakhar Sandhu
>Priority: Major
>
> I am trying to replace pandas with pyspark.pandas library, when I tried this :
> pdf is a pyspark.pandas dataframe
> {code:java}
> pdf["date_diff"] = (pdf["date1"] - pdf["date2"])/pdf.Timedelta(days=30){code}
> I got the below error :
> {code:java}
> File 
> "C:\Users\abc\Anaconda3\envs\test\lib\site-packages\pyspark\pandas\data_type_ops\datetime_ops.py",
>  line 75, in sub
> raise TypeError("Datetime subtraction can only be applied to datetime 
> series.") {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38636) AttributeError: module 'pyspark.pandas' has no attribute 'Timestamp'

2022-03-24 Thread Prakhar Sandhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511751#comment-17511751
 ] 

Prakhar Sandhu commented on SPARK-38636:


Hi [~hyukjin.kwon] , 

I was able to pass the above error by replacing pd.Timestamp with 
pd.to_datetime. 
{code:java}
import pyspark.pandas as pd
List[pd.to_datetime]{code}

> AttributeError: module 'pyspark.pandas' has no attribute 'Timestamp'
> 
>
> Key: SPARK-38636
> URL: https://issues.apache.org/jira/browse/SPARK-38636
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Prakhar Sandhu
>Priority: Major
>
> I am trying to replace pandas library with pyspark.pandas. 
> Tried something like below - 
> {code:java}
> List[pd.Timestamp] {code}
>  But it does not work and instead thrown the below error
> {code:java}
> AttributeError: module 'pyspark.pandas' has no attribute 'Timestamp'{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38627) TypeError: Datetime subtraction can only be applied to datetime series

2022-03-24 Thread Prakhar Sandhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511748#comment-17511748
 ] 

Prakhar Sandhu commented on SPARK-38627:


Hi [~hyukjin.kwon] , 

I am not sure if I would be able to share the full repo-code, but please try to 
run the below commands in spark 3.3. 

The below code snippet is running fine with pandas library but failed when I 
replaced pandas with pyspark.pandas : 

 
{code:java}
import pyspark.pandas as pd
import numpy as np

np.random.seed(0)

rng = pd.date_range('2015-02-24', periods=5, freq='T')
df = pd.DataFrame({ 'Date1': rng,  'Date2': rng}) 

print(df)

df["x"] = df["Date1"] - df["Date2"]

print(df) {code}

> TypeError: Datetime subtraction can only be applied to datetime series
> --
>
> Key: SPARK-38627
> URL: https://issues.apache.org/jira/browse/SPARK-38627
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Prakhar Sandhu
>Priority: Major
>
> I am trying to replace pandas with pyspark.pandas library, when I tried this :
> pdf is a pyspark.pandas dataframe
> {code:java}
> pdf["date_diff"] = (pdf["date1"] - pdf["date2"])/pdf.Timedelta(days=30){code}
> I got the below error :
> {code:java}
> File 
> "C:\Users\abc\Anaconda3\envs\test\lib\site-packages\pyspark\pandas\data_type_ops\datetime_ops.py",
>  line 75, in sub
> raise TypeError("Datetime subtraction can only be applied to datetime 
> series.") {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25789) Support for Dataset of Avro

2022-03-24 Thread IKozar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511747#comment-17511747
 ] 

IKozar commented on SPARK-25789:


my development is also blocked due to the issue.  what are the expectations of 
the fix?

> Support for Dataset of Avro
> ---
>
> Key: SPARK-25789
> URL: https://issues.apache.org/jira/browse/SPARK-25789
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Aleksander Eskilson
>Priority: Major
>
> Support for Dataset of Avro records in an API that would allow the user to 
> provide a class to an {{Encoder}} for Avro, analogous to the {{Bean}} 
> encoder. This functionality was previously to be provided by SPARK-22739 and 
> [Spark-Avro #169|https://github.com/databricks/spark-avro/issues/169]. Avro 
> functionality was folded into Spark-proper by SPARK-24768, eliminating the 
> need to maintain a separate library for Avro in Spark. Resolution of this 
> issue would:
>  * Add necessary {{Expression}} elements to Spark
>  * Add an {{AvroEncoder}} for Datasets of Avro records to Spark



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38614) After Spark update, df.show() shows incorrect F.percent_rank results

2022-03-24 Thread ZygD (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZygD updated SPARK-38614:
-
Summary: After Spark update, df.show() shows incorrect F.percent_rank 
results  (was: df.show() shows incorrect F.percent_rank results)

> After Spark update, df.show() shows incorrect F.percent_rank results
> 
>
> Key: SPARK-38614
> URL: https://issues.apache.org/jira/browse/SPARK-38614
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0, 3.2.1
>Reporter: ZygD
>Priority: Major
>  Labels: correctness
>
> Expected result is obtained using Spark 3.1.2, but not 3.2.0 or 3.2.1
> *Minimal reproducible example*
> {code:java}
> from pyspark.sql import SparkSession, functions as F, Window as W
> spark = SparkSession.builder.getOrCreate()
>  
> df = spark.range(101).withColumn('pr', F.percent_rank().over(W.orderBy('id')))
> df.show(3)
> df.show(5) {code}
> *Expected result*
> {code:java}
> +---++
> | id|  pr|
> +---++
> |  0| 0.0|
> |  1|0.01|
> |  2|0.02|
> +---++
> only showing top 3 rows
> +---++
> | id|  pr|
> +---++
> |  0| 0.0|
> |  1|0.01|
> |  2|0.02|
> |  3|0.03|
> |  4|0.04|
> +---++
> only showing top 5 rows{code}
> *Actual result*
> {code:java}
> +---+--+
> | id|pr|
> +---+--+
> |  0|   0.0|
> |  1|0.|
> |  2|0.|
> +---+--+
> only showing top 3 rows
> +---+---+
> | id| pr|
> +---+---+
> |  0|0.0|
> |  1|0.2|
> |  2|0.4|
> |  3|0.6|
> |  4|0.8|
> +---+---+
> only showing top 5 rows{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37568) Support 2-arguments by the convert_timezone() function

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511693#comment-17511693
 ] 

Apache Spark commented on SPARK-37568:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/35957

> Support 2-arguments by the convert_timezone() function
> --
>
> Key: SPARK-37568
> URL: https://issues.apache.org/jira/browse/SPARK-37568
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> # If sourceTs is a timestamp_ntz, take the sourceTz from the session time 
> zone, see the SQL config spark.sql.session.timeZone
> # If sourceTs is a timestamp_ltz, convert it to a timestamp_ntz using the 
> targetTz



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37568) Support 2-arguments by the convert_timezone() function

2022-03-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511692#comment-17511692
 ] 

Apache Spark commented on SPARK-37568:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/35957

> Support 2-arguments by the convert_timezone() function
> --
>
> Key: SPARK-37568
> URL: https://issues.apache.org/jira/browse/SPARK-37568
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> # If sourceTs is a timestamp_ntz, take the sourceTz from the session time 
> zone, see the SQL config spark.sql.session.timeZone
> # If sourceTs is a timestamp_ltz, convert it to a timestamp_ntz using the 
> targetTz



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38642) spark-sql can not enable isolatedClientLoader to extend dsv2 catalog when using builtin hiveMetastoreJar

2022-03-24 Thread suheng.cloud (Jira)
suheng.cloud created SPARK-38642:


 Summary: spark-sql can not enable isolatedClientLoader to extend 
dsv2 catalog when using builtin hiveMetastoreJar
 Key: SPARK-38642
 URL: https://issues.apache.org/jira/browse/SPARK-38642
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1, 3.1.2
Reporter: suheng.cloud


Hi, all:

I make use of IsolatedClientLoader to enable datasource v2 catalog on hive, It 
works well on api/spark-shell, while failed on spark-sql cmd.

After dig into source, I found that the SparkSQLCLIDriver(spark-sql) initialize 
differently by using CliSessionState which will be reused through cli lifecycle.

Thus the IsolatedClientLoader creator in HiveUtils will determine to off 
isolate because encoutering special global SessionState by that type.In my 
case, namespaces/tables will not recognized from another hive catalog since a 
CliSessionState in sparkSession will always be used to connected with.

I notice [SPARK-21428|https://issues.apache.org/jira/browse/SPARK-21428] but 
think that since the datasource v2 api should be more popular, 
SparkSQLCLIDriver should also adjust that?

my env:

spark-3.1.2
hadoop-cdh5.13.0
hive-2.3.6
for each v2 catalog set spark.sql.hive.metastore.jars=builtin(we have no auth 
to deploy jars on target clusters)

Now, for workaround this, we have to deploy jars on hdfs and use 'path' way 
which cause a significant delay on catalog initialize.

Any help is appreciate, thanks.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38588) Validate input dataset of ml.classification

2022-03-24 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-38588:
-
Fix Version/s: 3.4.0

> Validate input dataset of ml.classification
> ---
>
> Key: SPARK-38588
> URL: https://issues.apache.org/jira/browse/SPARK-38588
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Priority: Major
> Fix For: 3.4.0
>
>
> LinearSVC should fail fast if the input dataset contains invalid values.
>  
> {code:java}
> import org.apache.spark.ml.feature._
> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.classification._
> import org.apache.spark.ml.clustering._
> val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, 
> Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 
> 2.0.toDF()
> val svc = new LinearSVC()
> val model = svc.fit(df)
> scala> model.intercept
> res0: Double = NaN
> scala> model.coefficients
> res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38588) Validate input dataset of ml.classification

2022-03-24 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-38588.
--
Resolution: Resolved

> Validate input dataset of ml.classification
> ---
>
> Key: SPARK-38588
> URL: https://issues.apache.org/jira/browse/SPARK-38588
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Priority: Major
>
> LinearSVC should fail fast if the input dataset contains invalid values.
>  
> {code:java}
> import org.apache.spark.ml.feature._
> import org.apache.spark.ml.linalg._
> import org.apache.spark.ml.classification._
> import org.apache.spark.ml.clustering._
> val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, 
> Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, 
> 2.0.toDF()
> val svc = new LinearSVC()
> val model = svc.fit(df)
> scala> model.intercept
> res0: Double = NaN
> scala> model.coefficients
> res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37568) Support 2-arguments by the convert_timezone() function

2022-03-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-37568.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35951
[https://github.com/apache/spark/pull/35951]

> Support 2-arguments by the convert_timezone() function
> --
>
> Key: SPARK-37568
> URL: https://issues.apache.org/jira/browse/SPARK-37568
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> # If sourceTs is a timestamp_ntz, take the sourceTz from the session time 
> zone, see the SQL config spark.sql.session.timeZone
> # If sourceTs is a timestamp_ltz, convert it to a timestamp_ntz using the 
> targetTz



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37568) Support 2-arguments by the convert_timezone() function

2022-03-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-37568:


Assignee: Max Gekk

> Support 2-arguments by the convert_timezone() function
> --
>
> Key: SPARK-37568
> URL: https://issues.apache.org/jira/browse/SPARK-37568
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> # If sourceTs is a timestamp_ntz, take the sourceTz from the session time 
> zone, see the SQL config spark.sql.session.timeZone
> # If sourceTs is a timestamp_ltz, convert it to a timestamp_ntz using the 
> targetTz



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38623) Add more comments and tests for HashShuffleSpec

2022-03-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-38623:

Summary: Add more comments and tests for HashShuffleSpec  (was: Simplify 
the compatibility check in HashShuffleSpec)

> Add more comments and tests for HashShuffleSpec
> ---
>
> Key: SPARK-38623
> URL: https://issues.apache.org/jira/browse/SPARK-38623
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38063) Support SQL split_part function

2022-03-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38063.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35352
[https://github.com/apache/spark/pull/35352]

> Support SQL split_part function
> ---
>
> Key: SPARK-38063
> URL: https://issues.apache.org/jira/browse/SPARK-38063
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.3.0
>
>
> `split_part()` is a commonly supported function by other systems such as 
> Postgres and some other systems. The Spark equivalent  is 
> `element_at(split(arg, delim), part)`
> h5. Function Specificaiton
> h6. Syntax
> {code:java}
> split_part(str, delimiter, partNum)
> {code}
> h6. Arguments
> {code:java}
> str: string type
> delimiter: string type
> partNum: Integer type
> {code}
> h6. Note
> {code:java}
> 1. This function splits `str` by `delimiter` and return requested part of the 
> split (1-based). 
> 2. If any input parameter is NULL, return NULL.
> 3. If the index is out of range of split parts, returns empty stirng.
> 4. If `partNum` is 0, throws an error.
> 5. If `partNum` is negative, the parts are counted backward from the end of 
> the string
> 6. when delimiter is empty, str is considered not split thus there is just 1 
> split part. 
> {code}
> h6. Examples
> {code:java}
> > SELECT _FUNC_('11.12.13', '.', 3);
> 13
> > SELECT _FUNC_(NULL, '.', 3);
> NULL
> > SELECT _FUNC_('11.12.13', '', 1);
> '11.12.13'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38063) Support SQL split_part function

2022-03-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38063:
---

Assignee: Rui Wang

> Support SQL split_part function
> ---
>
> Key: SPARK-38063
> URL: https://issues.apache.org/jira/browse/SPARK-38063
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>
> `split_part()` is a commonly supported function by other systems such as 
> Postgres and some other systems. The Spark equivalent  is 
> `element_at(split(arg, delim), part)`
> h5. Function Specificaiton
> h6. Syntax
> {code:java}
> split_part(str, delimiter, partNum)
> {code}
> h6. Arguments
> {code:java}
> str: string type
> delimiter: string type
> partNum: Integer type
> {code}
> h6. Note
> {code:java}
> 1. This function splits `str` by `delimiter` and return requested part of the 
> split (1-based). 
> 2. If any input parameter is NULL, return NULL.
> 3. If the index is out of range of split parts, returns empty stirng.
> 4. If `partNum` is 0, throws an error.
> 5. If `partNum` is negative, the parts are counted backward from the end of 
> the string
> 6. when delimiter is empty, str is considered not split thus there is just 1 
> split part. 
> {code}
> h6. Examples
> {code:java}
> > SELECT _FUNC_('11.12.13', '.', 3);
> 13
> > SELECT _FUNC_(NULL, '.', 3);
> NULL
> > SELECT _FUNC_('11.12.13', '', 1);
> '11.12.13'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38585) Simplify the code of TreeNode.clone()

2022-03-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38585:
---

Assignee: Yang Jie

> Simplify the code of TreeNode.clone()
> -
>
> Key: SPARK-38585
> URL: https://issues.apache.org/jira/browse/SPARK-38585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> SPARK-28057 adds {{forceCopy}} to private {{mapChildren}} method in 
> {{TreeNode}} to realize the {{clone()}} method.
> After SPARK-34989, the call corresponding to {{forceCopy=false}} is changed 
> to use {{{}withNewChildren{}}}, and {{forceCopy}} always true and the private 
> {{mapChildren}} only used by {{clone()}} method.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38585) Simplify the code of TreeNode.clone()

2022-03-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38585.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35890
[https://github.com/apache/spark/pull/35890]

> Simplify the code of TreeNode.clone()
> -
>
> Key: SPARK-38585
> URL: https://issues.apache.org/jira/browse/SPARK-38585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> SPARK-28057 adds {{forceCopy}} to private {{mapChildren}} method in 
> {{TreeNode}} to realize the {{clone()}} method.
> After SPARK-34989, the call corresponding to {{forceCopy=false}} is changed 
> to use {{{}withNewChildren{}}}, and {{forceCopy}} always true and the private 
> {{mapChildren}} only used by {{clone()}} method.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org