date:20220822

[jira] [Commented] (SPARK-40173) Make pyspark.taskcontext examples self-contained

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583363#comment-17583363
 ] 

Apache Spark commented on SPARK-40173:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37623

> Make pyspark.taskcontext examples self-contained
> 
>
> Key: SPARK-40173
> URL: https://issues.apache.org/jira/browse/SPARK-40173
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark, Spark Core
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40173) Make pyspark.taskcontext examples self-contained

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40173:


Assignee: (was: Apache Spark)

> Make pyspark.taskcontext examples self-contained
> 
>
> Key: SPARK-40173
> URL: https://issues.apache.org/jira/browse/SPARK-40173
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark, Spark Core
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40173) Make pyspark.taskcontext examples self-contained

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583361#comment-17583361
 ] 

Apache Spark commented on SPARK-40173:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37623

> Make pyspark.taskcontext examples self-contained
> 
>
> Key: SPARK-40173
> URL: https://issues.apache.org/jira/browse/SPARK-40173
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark, Spark Core
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40173) Make pyspark.taskcontext examples self-contained

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40173:


Assignee: Apache Spark

> Make pyspark.taskcontext examples self-contained
> 
>
> Key: SPARK-40173
> URL: https://issues.apache.org/jira/browse/SPARK-40173
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark, Spark Core
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40187) Add doc for using Apache YuniKorn as a customized scheduler

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583352#comment-17583352
 ] 

Apache Spark commented on SPARK-40187:
--

User 'yangwwei' has created a pull request for this issue:
https://github.com/apache/spark/pull/37622

> Add doc for using Apache YuniKorn as a customized scheduler
> ---
>
> Key: SPARK-40187
> URL: https://issues.apache.org/jira/browse/SPARK-40187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Weiwei Yang
>Priority: Major
>
> Add a section under 
> https://spark.apache.org/docs/latest/running-on-kubernetes.html#customized-kubernetes-schedulers-for-spark-on-kubernetes
>  to explain how to run Spark with Apache YuniKorn. This is based on [this PR 
> review 
> comment|https://github.com/apache/spark/pull/35663#issuecomment-1220474012].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40187) Add doc for using Apache YuniKorn as a customized scheduler

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40187:


Assignee: (was: Apache Spark)

> Add doc for using Apache YuniKorn as a customized scheduler
> ---
>
> Key: SPARK-40187
> URL: https://issues.apache.org/jira/browse/SPARK-40187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Weiwei Yang
>Priority: Major
>
> Add a section under 
> https://spark.apache.org/docs/latest/running-on-kubernetes.html#customized-kubernetes-schedulers-for-spark-on-kubernetes
>  to explain how to run Spark with Apache YuniKorn. This is based on [this PR 
> review 
> comment|https://github.com/apache/spark/pull/35663#issuecomment-1220474012].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40187) Add doc for using Apache YuniKorn as a customized scheduler

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40187:


Assignee: Apache Spark

> Add doc for using Apache YuniKorn as a customized scheduler
> ---
>
> Key: SPARK-40187
> URL: https://issues.apache.org/jira/browse/SPARK-40187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Weiwei Yang
>Assignee: Apache Spark
>Priority: Major
>
> Add a section under 
> https://spark.apache.org/docs/latest/running-on-kubernetes.html#customized-kubernetes-schedulers-for-spark-on-kubernetes
>  to explain how to run Spark with Apache YuniKorn. This is based on [this PR 
> review 
> comment|https://github.com/apache/spark/pull/35663#issuecomment-1220474012].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40187) Add doc for using Apache YuniKorn as a customized scheduler

2022-08-22 Thread Weiwei Yang (Jira)

Weiwei Yang created SPARK-40187:
---

 Summary: Add doc for using Apache YuniKorn as a customized 
scheduler
 Key: SPARK-40187
 URL: https://issues.apache.org/jira/browse/SPARK-40187
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 3.3.0
Reporter: Weiwei Yang


Add a section under 
https://spark.apache.org/docs/latest/running-on-kubernetes.html#customized-kubernetes-schedulers-for-spark-on-kubernetes
 to explain how to run Spark with Apache YuniKorn. This is based on [this PR 
review 
comment|https://github.com/apache/spark/pull/35663#issuecomment-1220474012].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40186) mergedShuffleCleaner should have been shutdown before db closed

2022-08-22 Thread Yang Jie (Jira)

Yang Jie created SPARK-40186:


 Summary: mergedShuffleCleaner should have been shutdown before db 
closed
 Key: SPARK-40186
 URL: https://issues.apache.org/jira/browse/SPARK-40186
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Yang Jie


Should ensure `RemoteBlockPushResolver#mergedShuffleCleaner` have been shutdown 
before `RemoteBlockPushResolver#db` closed, otherwise, 
`RemoteBlockPushResolver#applicationRemoved` may perform delete operations on a 
closed db.

 

https://github.com/apache/spark/pull/37610#discussion_r951185256

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values

2022-08-22 Thread Vivek Garg (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583315#comment-17583315
 ] 

Vivek Garg commented on SPARK-22588:


We offer comprehensive [Splunk online 
training|https://www.igmguru.com/big-data/splunk-training/] that also covers a 
variety of administrative and support options.

> SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
> -
>
> Key: SPARK-22588
> URL: https://issues.apache.org/jira/browse/SPARK-22588
> Project: Spark
>  Issue Type: Question
>  Components: Deploy
>Affects Versions: 2.1.1
>Reporter: Saanvi Sharma
>Priority: Minor
>  Labels: dynamodb, spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am using spark 2.1 on EMR and i have a dataframe like this:
>  ClientNum  | Value_1  | Value_2 | Value_3  | Value_4
>  14 |A |B|   C  |   null
>  19 |X |Y|  null|   null
>  21 |R |   null  |  null|   null
> I want to load data into DynamoDB table with ClientNum as key fetching:
> Analyze Your Data on Amazon DynamoDB with apche Spark11
> Using Spark SQL for ETL3
> here is my code that I tried to solve:
>   var jobConf = new JobConf(sc.hadoopConfiguration)
>   jobConf.set("dynamodb.servicename", "dynamodb")
>   jobConf.set("dynamodb.input.tableName", "table_name")   
>   jobConf.set("dynamodb.output.tableName", "table_name")   
>   jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com")
>   jobConf.set("dynamodb.regionid", "eu-west-1")
>   jobConf.set("dynamodb.throughput.read", "1")
>   jobConf.set("dynamodb.throughput.read.percent", "1")
>   jobConf.set("dynamodb.throughput.write", "1")
>   jobConf.set("dynamodb.throughput.write.percent", "1")
>   
>   jobConf.set("mapred.output.format.class", 
> "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
>   jobConf.set("mapred.input.format.class", 
> "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
>   #Import Data
>   val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load(path)
> I performed a transformation to have an RDD that matches the types that the 
> DynamoDB custom output format knows how to write. The custom output format 
> expects a tuple containing the Text and DynamoDBItemWritable types.
> Create a new RDD with those types in it, in the following map call:
>   #Convert the dataframe to rdd
>   val df_rdd = df.rdd
>   > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
> MapPartitionsRDD[10] at rdd at :41
>   
>   #Print first rdd
>   df_rdd.take(1)
>   > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null])
>   var ddbInsertFormattedRDD = df_rdd.map(a => {
>   var ddbMap = new HashMap[String, AttributeValue]()
>   var ClientNum = new AttributeValue()
>   ClientNum.setN(a.get(0).toString)
>   ddbMap.put("ClientNum", ClientNum)
>   var Value_1 = new AttributeValue()
>   Value_1.setS(a.get(1).toString)
>   ddbMap.put("Value_1", Value_1)
>   var Value_2 = new AttributeValue()
>   Value_2.setS(a.get(2).toString)
>   ddbMap.put("Value_2", Value_2)
>   var Value_3 = new AttributeValue()
>   Value_3.setS(a.get(3).toString)
>   ddbMap.put("Value_3", Value_3)
>   var Value_4 = new AttributeValue()
>   Value_4.setS(a.get(4).toString)
>   ddbMap.put("Value_4", Value_4)
>   var item = new DynamoDBItemWritable()
>   item.setItem(ddbMap)
>   (new Text(""), item)
>   })
> This last call uses the job configuration that defines the EMR-DDB connector 
> to write out the new RDD you created in the expected format:
> ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
> fails with the follwoing error:
> Caused by: java.lang.NullPointerException
> null values caused the error, if I try with ClientNum and Value_1 it works 
> data is correctly inserted on DynamoDB table.
> Thanks for your help !!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40185) Remove column suggestion when the candidate list is empty for unresolved column/attribute/map key

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40185:


Assignee: (was: Apache Spark)

> Remove column suggestion when the candidate list is empty for unresolved 
> column/attribute/map key
> -
>
> Key: SPARK-40185
> URL: https://issues.apache.org/jira/browse/SPARK-40185
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Priority: Major
>
> For unresolved column, attribute or map key an error message might contain 
> suggestions from the list. However, when the list is empty the error message 
> looks incomplete:
> `[UNRESOLVED_COLUMN] A column or function parameter with name 'YrMo' cannot 
> be resolved. Did you mean one of the following? []`
> This issue is to make final suggestion to show only if suggestion list is non 
> empty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40185) Remove column suggestion when the candidate list is empty for unresolved column/attribute/map key

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583299#comment-17583299
 ] 

Apache Spark commented on SPARK-40185:
--

User 'vitaliili-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/37621

> Remove column suggestion when the candidate list is empty for unresolved 
> column/attribute/map key
> -
>
> Key: SPARK-40185
> URL: https://issues.apache.org/jira/browse/SPARK-40185
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Priority: Major
>
> For unresolved column, attribute or map key an error message might contain 
> suggestions from the list. However, when the list is empty the error message 
> looks incomplete:
> `[UNRESOLVED_COLUMN] A column or function parameter with name 'YrMo' cannot 
> be resolved. Did you mean one of the following? []`
> This issue is to make final suggestion to show only if suggestion list is non 
> empty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40185) Remove column suggestion when the candidate list is empty for unresolved column/attribute/map key

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583298#comment-17583298
 ] 

Apache Spark commented on SPARK-40185:
--

User 'vitaliili-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/37621

> Remove column suggestion when the candidate list is empty for unresolved 
> column/attribute/map key
> -
>
> Key: SPARK-40185
> URL: https://issues.apache.org/jira/browse/SPARK-40185
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Priority: Major
>
> For unresolved column, attribute or map key an error message might contain 
> suggestions from the list. However, when the list is empty the error message 
> looks incomplete:
> `[UNRESOLVED_COLUMN] A column or function parameter with name 'YrMo' cannot 
> be resolved. Did you mean one of the following? []`
> This issue is to make final suggestion to show only if suggestion list is non 
> empty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40185) Remove column suggestion when the candidate list is empty for unresolved column/attribute/map key

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40185:


Assignee: Apache Spark

> Remove column suggestion when the candidate list is empty for unresolved 
> column/attribute/map key
> -
>
> Key: SPARK-40185
> URL: https://issues.apache.org/jira/browse/SPARK-40185
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Vitalii Li
>Assignee: Apache Spark
>Priority: Major
>
> For unresolved column, attribute or map key an error message might contain 
> suggestions from the list. However, when the list is empty the error message 
> looks incomplete:
> `[UNRESOLVED_COLUMN] A column or function parameter with name 'YrMo' cannot 
> be resolved. Did you mean one of the following? []`
> This issue is to make final suggestion to show only if suggestion list is non 
> empty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40185) Remove column suggestion when the candidate list is empty for unresolved column/attribute/map key

2022-08-22 Thread Vitalii Li (Jira)

Vitalii Li created SPARK-40185:
--

 Summary: Remove column suggestion when the candidate list is empty 
for unresolved column/attribute/map key
 Key: SPARK-40185
 URL: https://issues.apache.org/jira/browse/SPARK-40185
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Vitalii Li


For unresolved column, attribute or map key an error message might contain 
suggestions from the list. However, when the list is empty the error message 
looks incomplete:

`[UNRESOLVED_COLUMN] A column or function parameter with name 'YrMo' cannot be 
resolved. Did you mean one of the following? []`

This issue is to make final suggestion to show only if suggestion list is non 
empty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40160) Make pyspark.broadcast examples self-contained

2022-08-22 Thread Qian Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583291#comment-17583291
 ] 

Qian Sun commented on SPARK-40160:
--

working on it :)

> Make pyspark.broadcast examples self-contained
> --
>
> Key: SPARK-40160
> URL: https://issues.apache.org/jira/browse/SPARK-40160
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40184) Support modify the comment of a partitioned column

2022-08-22 Thread melin (Jira)

melin created SPARK-40184:
-

 Summary: Support modify the comment of a partitioned column
 Key: SPARK-40184
 URL: https://issues.apache.org/jira/browse/SPARK-40184
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: melin


Comment is not added to the partition field when the table is created. Can 
modify the partition field Comment



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40183) Use error class NUMERIC_VALUE_OUT_OF_RANGE for overflow in decimal conversion

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40183:


Assignee: Apache Spark  (was: Gengliang Wang)

> Use error class NUMERIC_VALUE_OUT_OF_RANGE for overflow in decimal conversion
> -
>
> Key: SPARK-40183
> URL: https://issues.apache.org/jira/browse/SPARK-40183
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Use error class NUMERIC_VALUE_OUT_OF_RANGE for overflow in decimal 
> conversion, instead of the confusing error class 
> `CANNOT_CHANGE_DECIMAL_PRECISION`.
> Also, use `decimal.toPlainString` instead of `decimal.toDebugString` in the 
> error message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40183) Use error class NUMERIC_VALUE_OUT_OF_RANGE for overflow in decimal conversion

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40183:


Assignee: Gengliang Wang  (was: Apache Spark)

> Use error class NUMERIC_VALUE_OUT_OF_RANGE for overflow in decimal conversion
> -
>
> Key: SPARK-40183
> URL: https://issues.apache.org/jira/browse/SPARK-40183
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Use error class NUMERIC_VALUE_OUT_OF_RANGE for overflow in decimal 
> conversion, instead of the confusing error class 
> `CANNOT_CHANGE_DECIMAL_PRECISION`.
> Also, use `decimal.toPlainString` instead of `decimal.toDebugString` in the 
> error message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40183) Use error class NUMERIC_VALUE_OUT_OF_RANGE for overflow in decimal conversion

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583280#comment-17583280
 ] 

Apache Spark commented on SPARK-40183:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37620

> Use error class NUMERIC_VALUE_OUT_OF_RANGE for overflow in decimal conversion
> -
>
> Key: SPARK-40183
> URL: https://issues.apache.org/jira/browse/SPARK-40183
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Use error class NUMERIC_VALUE_OUT_OF_RANGE for overflow in decimal 
> conversion, instead of the confusing error class 
> `CANNOT_CHANGE_DECIMAL_PRECISION`.
> Also, use `decimal.toPlainString` instead of `decimal.toDebugString` in the 
> error message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40183) Use error class NUMERIC_VALUE_OUT_OF_RANGE for overflow in decimal conversion

2022-08-22 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-40183:
--

 Summary: Use error class NUMERIC_VALUE_OUT_OF_RANGE for overflow 
in decimal conversion
 Key: SPARK-40183
 URL: https://issues.apache.org/jira/browse/SPARK-40183
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Use error class NUMERIC_VALUE_OUT_OF_RANGE for overflow in decimal conversion, 
instead of the confusing error class `CANNOT_CHANGE_DECIMAL_PRECISION`.

Also, use `decimal.toPlainString` instead of `decimal.toDebugString` in the 
error message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39917) Use different error classes for numeric/interval arithmetic overflow

2022-08-22 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-39917:
---
Parent: SPARK-40182
Issue Type: Sub-task  (was: Task)

> Use different error classes for numeric/interval arithmetic overflow
> 
>
> Key: SPARK-39917
> URL: https://issues.apache.org/jira/browse/SPARK-39917
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, when  arithmetic overflow errors happen under ANSI mode, the error 
> messages are like
> [ARITHMETIC_OVERFLOW] long overflow. Use 'try_multiply' to tolerate overflow 
> and return NULL instead. If necessary set spark.sql.ansi.enabled to "false" 
>  
> The "(except for ANSI interval type)" part is confusing. We should remove it 
> for the numeric arithmetic operations and have a new error class for the 
> interval division error: INTERVAL_ARITHMETIC_OVERFLOW



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39865) Show proper error messages on the overflow errors of table insert

2022-08-22 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-39865:
---
Description: 
In Spark 3.3, the error message of ANSI CAST is improved. However, the table 
insertion is using the same CAST expression:


{code:java}
> create table tiny(i tinyint);
> insert into tiny values (1000);
org.apache.spark.SparkArithmeticException[CAST_OVERFLOW]: The value 1000 of the 
type "INT" cannot be cast to "TINYINT" due to an overflow. Use `try_cast` to 
tolerate overflow and return NULL instead. If necessary set 
"spark.sql.ansi.enabled" to "false" to bypass this error.
{code}
 

Showing the hint of `If necessary set "spark.sql.ansi.enabled" to "false" to 
bypass this error` doesn't help at all. This PR is to fix the error message. 
After changes, the error message of this example will become:
{code:java}
org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW_IN_TABLE_INSERT] Fail 
to insert a value of "INT" type into the "TINYINT" type column `i` due to an 
overflow. Use `try_cast` on the input value to tolerate overflow and return 
NULL instead.{code}

> Show proper error messages on the overflow errors of table insert
> -
>
> Key: SPARK-39865
> URL: https://issues.apache.org/jira/browse/SPARK-39865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.1
>
>
> In Spark 3.3, the error message of ANSI CAST is improved. However, the table 
> insertion is using the same CAST expression:
> {code:java}
> > create table tiny(i tinyint);
> > insert into tiny values (1000);
> org.apache.spark.SparkArithmeticException[CAST_OVERFLOW]: The value 1000 of 
> the type "INT" cannot be cast to "TINYINT" due to an overflow. Use `try_cast` 
> to tolerate overflow and return NULL instead. If necessary set 
> "spark.sql.ansi.enabled" to "false" to bypass this error.
> {code}
>  
> Showing the hint of `If necessary set "spark.sql.ansi.enabled" to "false" to 
> bypass this error` doesn't help at all. This PR is to fix the error message. 
> After changes, the error message of this example will become:
> {code:java}
> org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW_IN_TABLE_INSERT] 
> Fail to insert a value of "INT" type into the "TINYINT" type column `i` due 
> to an overflow. Use `try_cast` on the input value to tolerate overflow and 
> return NULL instead.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39865) Show proper error messages on the overflow errors of table insert

2022-08-22 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-39865:
---
Parent: SPARK-40182
Issue Type: Sub-task  (was: Bug)

> Show proper error messages on the overflow errors of table insert
> -
>
> Key: SPARK-39865
> URL: https://issues.apache.org/jira/browse/SPARK-39865
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.1
>
>
> In Spark 3.3, the error message of ANSI CAST is improved. However, the table 
> insertion is using the same CAST expression:
> {code:java}
> > create table tiny(i tinyint);
> > insert into tiny values (1000);
> org.apache.spark.SparkArithmeticException[CAST_OVERFLOW]: The value 1000 of 
> the type "INT" cannot be cast to "TINYINT" due to an overflow. Use `try_cast` 
> to tolerate overflow and return NULL instead. If necessary set 
> "spark.sql.ansi.enabled" to "false" to bypass this error.
> {code}
>  
> Showing the hint of `If necessary set "spark.sql.ansi.enabled" to "false" to 
> bypass this error` doesn't help at all. This PR is to fix the error message. 
> After changes, the error message of this example will become:
> {code:java}
> org.apache.spark.SparkArithmeticException: [CAST_OVERFLOW_IN_TABLE_INSERT] 
> Fail to insert a value of "INT" type into the "TINYINT" type column `i` due 
> to an overflow. Use `try_cast` on the input value to tolerate overflow and 
> return NULL instead.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39889) Use different error classes for numeric/interval divided by 0

2022-08-22 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-39889:
---
Parent: SPARK-40182
Issue Type: Sub-task  (was: Task)

> Use different error classes for numeric/interval divided by 0
> -
>
> Key: SPARK-39889
> URL: https://issues.apache.org/jira/browse/SPARK-39889
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, when numbers are divided by 0 under ANSI mode, the error message 
> is like
> {quote}[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate 
> divisor being 0 and return NULL instead. If necessary set "ansi_mode" to 
> "false" (except for ANSI interval type) to bypass this error.{quote}
> The "(except for ANSI interval type)" part is confusing.  We should remove it 
> and have a new error class "INTERVAL_DIVIDED_BY_ZERO"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40182) Improve ANSI runtime error messages

2022-08-22 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-40182:
--

 Summary: Improve ANSI runtime error messages
 Key: SPARK-40182
 URL: https://issues.apache.org/jira/browse/SPARK-40182
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 3.4.0
Reporter: Gengliang Wang


Improve the runtime error messages related to the ANSI SQL mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40165) Update test plugins to latest versions

2022-08-22 Thread BingKun Pan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583274#comment-17583274
 ] 

BingKun Pan commented on SPARK-40165:
-

I will investigate the root cause for the failure carefully.

> Update test plugins to latest versions
> --
>
> Key: SPARK-40165
> URL: https://issues.apache.org/jira/browse/SPARK-40165
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Trivial
>
> Include:
>  * 1.scalacheck (from 1.15.4 to 1.16.0)
>  * 2.maven-surefire-plugin (from 3.0.0-M5 to 3.0.0-M7)
>  * 3.maven-dependency-plugin (from 3.1.1 to 3.3.0)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40081) Add Document Parameters for pyspark.sql.streaming.query

2022-08-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40081:


Assignee: Qian Sun

> Add Document Parameters for pyspark.sql.streaming.query
> ---
>
> Key: SPARK-40081
> URL: https://issues.apache.org/jira/browse/SPARK-40081
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Assignee: Qian Sun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40081) Add Document Parameters for pyspark.sql.streaming.query

2022-08-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40081.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37587
[https://github.com/apache/spark/pull/37587]

> Add Document Parameters for pyspark.sql.streaming.query
> ---
>
> Key: SPARK-40081
> URL: https://issues.apache.org/jira/browse/SPARK-40081
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.4.0
>Reporter: Qian Sun
>Assignee: Qian Sun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40142) Make pyspark.sql.functions examples self-contained

2022-08-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40142:


Assignee: Hyukjin Kwon

> Make pyspark.sql.functions examples self-contained
> --
>
> Key: SPARK-40142
> URL: https://issues.apache.org/jira/browse/SPARK-40142
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40142) Make pyspark.sql.functions examples self-contained

2022-08-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40142.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37592
[https://github.com/apache/spark/pull/37592]

> Make pyspark.sql.functions examples self-contained
> --
>
> Key: SPARK-40142
> URL: https://issues.apache.org/jira/browse/SPARK-40142
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40088) Add SparkPlanWIthAQESuite

2022-08-22 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-40088.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37619
[https://github.com/apache/spark/pull/37619]

> Add SparkPlanWIthAQESuite
> -
>
> Key: SPARK-40088
> URL: https://issues.apache.org/jira/browse/SPARK-40088
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Minor
> Fix For: 3.4.0
>
>
> Currently `SparkPlanSuite` assumes that AQE is always turned off. We should 
> also test with AQE turned on



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40088) Add SparkPlanWIthAQESuite

2022-08-22 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-40088:
---

Assignee: Kazuyuki Tanimura

> Add SparkPlanWIthAQESuite
> -
>
> Key: SPARK-40088
> URL: https://issues.apache.org/jira/browse/SPARK-40088
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Minor
>
> Currently `SparkPlanSuite` assumes that AQE is always turned off. We should 
> also test with AQE turned on



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40181) DataFrame.intersect and .intersectAll are inconsistently dropping rows

2022-08-22 Thread Luke (Jira)

Luke created SPARK-40181:


 Summary: DataFrame.intersect and .intersectAll are inconsistently 
dropping rows
 Key: SPARK-40181
 URL: https://issues.apache.org/jira/browse/SPARK-40181
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.1
Reporter: Luke


I don't have a minimal reproducible example for this, but the place where it 
shows up in our workflow is very simple.

The data in "COLUMN" are a few hundred million distinct strings (gets 
deduplicated in the plan also) and it is being compared against itself using 
intersect.

The code that is failing is essentially:
{quote}values = [...] # python list containing many unique strings, none of 
which are None

df = spark.createDataFrame(
    spark.sparkContext.parallelize(
        [(value,) for value in values], numSlices=2 + len(values) // 1
    ),
    schema=StructType([StructField("COLUMN", StringType())]),
)

df = df.distinct()

assert df.count() == df.intersect(df).count()

assert df.count() == df.intersectAll(df).count()
{quote}
The issue is that both of the above asserts sometimes pass, and sometimes fail 
(technically we haven't seen intersectAll pass yet, but we have only tried a 
few times). One thing which is striking is that if you call 
df.intersect(df).count() multiple times, the returned count is not always the 
same. Sometimes it is exactly df.count(), sometimes it is ~1% lower, but how 
much lower exactly seems random.

In particular, we have called df.intersect(df).count() twice in a row, and got 
two different counts, which is very surprising given that df should be 
deterministic, and suggests maybe there is some kind of 
concurrency/inconsistent hashing issue?

One other thing which is possibly noteworthy is that using df.join(df, 
df.columns, how="inner") does seem to reliably have the desired behavior (not 
dropping any rows).

Here is the resulting plan from df.intersect(df)
{quote}== Parsed Logical Plan ==
'Intersect false
:- Deduplicate [COLUMN#144487]
:  +- LogicalRDD [COLUMN#144487], false
+- Deduplicate [COLUMN#144487]
   +- LogicalRDD [COLUMN#144487], false

== Analyzed Logical Plan ==
COLUMN: string
Intersect false
:- Deduplicate [COLUMN#144487]
:  +- LogicalRDD [COLUMN#144487], false
+- Deduplicate [COLUMN#144523]
   +- LogicalRDD [COLUMN#144523], false

== Optimized Logical Plan ==
Aggregate [COLUMN#144487], [COLUMN#144487]
+- Join LeftSemi, (COLUMN#144487 <=> COLUMN#144523)
   :- LogicalRDD [COLUMN#144487], false
   +- Aggregate [COLUMN#144523], [COLUMN#144523]
      +- LogicalRDD [COLUMN#144523], false

== Physical Plan ==
*(7) HashAggregate(keys=[COLUMN#144487], functions=[], output=[COLUMN#144487])
+- Exchange hashpartitioning(COLUMN#144487, 200), true, [id=#22790]
   +- *(6) HashAggregate(keys=[COLUMN#144487], functions=[], 
output=[COLUMN#144487])
      +- *(6) SortMergeJoin [coalesce(COLUMN#144487, ), isnull(COLUMN#144487)], 
[coalesce(COLUMN#144523, ), isnull(COLUMN#144523)], LeftSemi
         :- *(2) Sort [coalesce(COLUMN#144487, ) ASC NULLS FIRST, 
isnull(COLUMN#144487) ASC NULLS FIRST], false, 0
         :  +- Exchange hashpartitioning(coalesce(COLUMN#144487, ), 
isnull(COLUMN#144487), 200), true, [id=#22772]
         :     +- *(1) Scan ExistingRDD[COLUMN#144487]
         +- *(5) Sort [coalesce(COLUMN#144523, ) ASC NULLS FIRST, 
isnull(COLUMN#144523) ASC NULLS FIRST], false, 0
            +- Exchange hashpartitioning(coalesce(COLUMN#144523, ), 
isnull(COLUMN#144523), 200), true, [id=#22782]
               +- *(4) HashAggregate(keys=[COLUMN#144523], functions=[], 
output=[COLUMN#144523])
                  +- Exchange hashpartitioning(COLUMN#144523, 200), true, 
[id=#22778]
                     +- *(3) HashAggregate(keys=[COLUMN#144523], functions=[], 
output=[COLUMN#144523])
                        +- *(3) Scan ExistingRDD[COLUMN#144523]
{quote}
and for df.intersectAll(df)
{quote}== Parsed Logical Plan ==
'IntersectAll true
:- Deduplicate [COLUMN#144487]
:  +- LogicalRDD [COLUMN#144487], false
+- Deduplicate [COLUMN#144487]
   +- LogicalRDD [COLUMN#144487], false

== Analyzed Logical Plan ==
COLUMN: string
IntersectAll true
:- Deduplicate [COLUMN#144487]
:  +- LogicalRDD [COLUMN#144487], false
+- Deduplicate [COLUMN#144533]
   +- LogicalRDD [COLUMN#144533], false

== Optimized Logical Plan ==
Project [COLUMN#144487]
+- Generate replicaterows(min_count#144566L, COLUMN#144487), [1], false, 
[COLUMN#144487]
   +- Project [COLUMN#144487, if ((vcol1_count#144563L > vcol2_count#144565L)) 
vcol2_count#144565L else vcol1_count#144563L AS min_count#144566L]
      +- Filter ((vcol1_count#144563L >= 1) AND (vcol2_count#144565L >= 1))
         +- Aggregate [COLUMN#144487], [count(vcol1#144558) AS 
vcol1_count#144563L, count(vcol2#144561) AS vcol2_count#144565L, COLUMN#144487]
            +- Union
               :- Aggregate [COLUMN#144487], [true AS vcol1#144558, null AS

[jira] [Assigned] (SPARK-40180) Format error messages by spark-sql

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40180:


Assignee: Max Gekk  (was: Apache Spark)

> Format error messages by spark-sql
> --
>
> Key: SPARK-40180
> URL: https://issues.apache.org/jira/browse/SPARK-40180
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Respect the SQL config spark.sql.error.messageFormat in the implementation of 
> the SQL CLI: spark-sql.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40180) Format error messages by spark-sql

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583143#comment-17583143
 ] 

Apache Spark commented on SPARK-40180:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37590

> Format error messages by spark-sql
> --
>
> Key: SPARK-40180
> URL: https://issues.apache.org/jira/browse/SPARK-40180
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Respect the SQL config spark.sql.error.messageFormat in the implementation of 
> the SQL CLI: spark-sql.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40180) Format error messages by spark-sql

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40180:


Assignee: Apache Spark  (was: Max Gekk)

> Format error messages by spark-sql
> --
>
> Key: SPARK-40180
> URL: https://issues.apache.org/jira/browse/SPARK-40180
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Respect the SQL config spark.sql.error.messageFormat in the implementation of 
> the SQL CLI: spark-sql.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40180) Format error messages by spark-sql

2022-08-22 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40180:
-
Description: Respect the SQL config spark.sql.error.messageFormat in the 
implementation of the SQL CLI: spark-sql.  (was: # Introduce a config to 
control the format of error messages: plain text and JSON
# Modify the Thrift Server to output errors from Spark SQL according to the 
config)

> Format error messages by spark-sql
> --
>
> Key: SPARK-40180
> URL: https://issues.apache.org/jira/browse/SPARK-40180
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Respect the SQL config spark.sql.error.messageFormat in the implementation of 
> the SQL CLI: spark-sql.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40180) Format error messages by spark-sql

2022-08-22 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40180:
-
Fix Version/s: (was: 3.4.0)

> Format error messages by spark-sql
> --
>
> Key: SPARK-40180
> URL: https://issues.apache.org/jira/browse/SPARK-40180
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> # Introduce a config to control the format of error messages: plain text and 
> JSON
> # Modify the Thrift Server to output errors from Spark SQL according to the 
> config



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40180) Format error messages by spark-sql

2022-08-22 Thread Max Gekk (Jira)

Max Gekk created SPARK-40180:


 Summary: Format error messages by spark-sql
 Key: SPARK-40180
 URL: https://issues.apache.org/jira/browse/SPARK-40180
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk
 Fix For: 3.4.0


# Introduce a config to control the format of error messages: plain text and 
JSON
# Modify the Thrift Server to output errors from Spark SQL according to the 
config



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40166) Add array_sort(column, comparator) to PySpark

2022-08-22 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-40166.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37600
[https://github.com/apache/spark/pull/37600]

> Add array_sort(column, comparator) to PySpark
> -
>
> Key: SPARK-40166
> URL: https://issues.apache.org/jira/browse/SPARK-40166
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.4.0
>
>
> SPARK-39925 exposed array_sort(column, comparator) on JVM. It should be 
> available in Python as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40166) Add array_sort(column, comparator) to PySpark

2022-08-22 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-40166:
--

Assignee: Maciej Szymkiewicz

> Add array_sort(column, comparator) to PySpark
> -
>
> Key: SPARK-40166
> URL: https://issues.apache.org/jira/browse/SPARK-40166
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
>
> SPARK-39925 exposed array_sort(column, comparator) on JVM. It should be 
> available in Python as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40167) Add array_sort(column, comparator) to SparkR

2022-08-22 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-40167.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37600
[https://github.com/apache/spark/pull/37600]

> Add array_sort(column, comparator) to SparkR
> 
>
> Key: SPARK-40167
> URL: https://issues.apache.org/jira/browse/SPARK-40167
> Project: Spark
>  Issue Type: Improvement
>  Components: R, SQL
>Affects Versions: 3.4.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.4.0
>
>
> SPARK-39925 exposed array_sort(column, comparator) on JVM. It should be 
> available in R as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40167) Add array_sort(column, comparator) to SparkR

2022-08-22 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-40167:
--

Assignee: Maciej Szymkiewicz

> Add array_sort(column, comparator) to SparkR
> 
>
> Key: SPARK-40167
> URL: https://issues.apache.org/jira/browse/SPARK-40167
> Project: Spark
>  Issue Type: Improvement
>  Components: R, SQL
>Affects Versions: 3.4.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
>
> SPARK-39925 exposed array_sort(column, comparator) on JVM. It should be 
> available in R as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40088) Add SparkPlanWIthAQESuite

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583112#comment-17583112
 ] 

Apache Spark commented on SPARK-40088:
--

User 'kazuyukitanimura' has created a pull request for this issue:
https://github.com/apache/spark/pull/37619

> Add SparkPlanWIthAQESuite
> -
>
> Key: SPARK-40088
> URL: https://issues.apache.org/jira/browse/SPARK-40088
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> Currently `SparkPlanSuite` assumes that AQE is always turned off. We should 
> also test with AQE turned on



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40088) Add SparkPlanWIthAQESuite

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40088:


Assignee: Apache Spark

> Add SparkPlanWIthAQESuite
> -
>
> Key: SPARK-40088
> URL: https://issues.apache.org/jira/browse/SPARK-40088
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Assignee: Apache Spark
>Priority: Minor
>
> Currently `SparkPlanSuite` assumes that AQE is always turned off. We should 
> also test with AQE turned on



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40088) Add SparkPlanWIthAQESuite

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583111#comment-17583111
 ] 

Apache Spark commented on SPARK-40088:
--

User 'kazuyukitanimura' has created a pull request for this issue:
https://github.com/apache/spark/pull/37619

> Add SparkPlanWIthAQESuite
> -
>
> Key: SPARK-40088
> URL: https://issues.apache.org/jira/browse/SPARK-40088
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> Currently `SparkPlanSuite` assumes that AQE is always turned off. We should 
> also test with AQE turned on



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40088) Add SparkPlanWIthAQESuite

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40088:


Assignee: (was: Apache Spark)

> Add SparkPlanWIthAQESuite
> -
>
> Key: SPARK-40088
> URL: https://issues.apache.org/jira/browse/SPARK-40088
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> Currently `SparkPlanSuite` assumes that AQE is always turned off. We should 
> also test with AQE turned on



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40165) Update test plugins to latest versions

2022-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40165:
--
Fix Version/s: (was: 3.4.0)

> Update test plugins to latest versions
> --
>
> Key: SPARK-40165
> URL: https://issues.apache.org/jira/browse/SPARK-40165
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Trivial
>
> Include:
>  * 1.scalacheck (from 1.15.4 to 1.16.0)
>  * 2.maven-surefire-plugin (from 3.0.0-M5 to 3.0.0-M7)
>  * 3.maven-dependency-plugin (from 3.1.1 to 3.3.0)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40165) Update test plugins to latest versions

2022-08-22 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583079#comment-17583079
 ] 

Dongjoon Hyun commented on SPARK-40165:
---

This is reverted via 
https://github.com/apache/spark/commit/b6192126351ea2ae658e2f0cfd8c57baf3f1d900

> Update test plugins to latest versions
> --
>
> Key: SPARK-40165
> URL: https://issues.apache.org/jira/browse/SPARK-40165
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Trivial
>
> Include:
>  * 1.scalacheck (from 1.15.4 to 1.16.0)
>  * 2.maven-surefire-plugin (from 3.0.0-M5 to 3.0.0-M7)
>  * 3.maven-dependency-plugin (from 3.1.1 to 3.3.0)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40165) Update test plugins to latest versions

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40165:


Assignee: (was: Apache Spark)

> Update test plugins to latest versions
> --
>
> Key: SPARK-40165
> URL: https://issues.apache.org/jira/browse/SPARK-40165
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Trivial
> Fix For: 3.4.0
>
>
> Include:
>  * 1.scalacheck (from 1.15.4 to 1.16.0)
>  * 2.maven-surefire-plugin (from 3.0.0-M5 to 3.0.0-M7)
>  * 3.maven-dependency-plugin (from 3.1.1 to 3.3.0)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-40165) Update test plugins to latest versions

2022-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-40165:
---
  Assignee: (was: BingKun Pan)

> Update test plugins to latest versions
> --
>
> Key: SPARK-40165
> URL: https://issues.apache.org/jira/browse/SPARK-40165
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Priority: Trivial
> Fix For: 3.4.0
>
>
> Include:
>  * 1.scalacheck (from 1.15.4 to 1.16.0)
>  * 2.maven-surefire-plugin (from 3.0.0-M5 to 3.0.0-M7)
>  * 3.maven-dependency-plugin (from 3.1.1 to 3.3.0)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40165) Update test plugins to latest versions

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40165:


Assignee: Apache Spark

> Update test plugins to latest versions
> --
>
> Key: SPARK-40165
> URL: https://issues.apache.org/jira/browse/SPARK-40165
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Trivial
> Fix For: 3.4.0
>
>
> Include:
>  * 1.scalacheck (from 1.15.4 to 1.16.0)
>  * 2.maven-surefire-plugin (from 3.0.0-M5 to 3.0.0-M7)
>  * 3.maven-dependency-plugin (from 3.1.1 to 3.3.0)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40165) Update test plugins to latest versions

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17583024#comment-17583024
 ] 

Apache Spark commented on SPARK-40165:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37618

> Update test plugins to latest versions
> --
>
> Key: SPARK-40165
> URL: https://issues.apache.org/jira/browse/SPARK-40165
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Trivial
> Fix For: 3.4.0
>
>
> Include:
>  * 1.scalacheck (from 1.15.4 to 1.16.0)
>  * 2.maven-surefire-plugin (from 3.0.0-M5 to 3.0.0-M7)
>  * 3.maven-dependency-plugin (from 3.1.1 to 3.3.0)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39755) Improve LocalDirsFeatureStep to randomize local directories

2022-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-39755:
--
Affects Version/s: 3.4.0
   (was: 3.3.0)

> Improve LocalDirsFeatureStep to randomize local directories
> ---
>
> Key: SPARK-39755
> URL: https://issues.apache.org/jira/browse/SPARK-39755
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: pralabhkumar
>Assignee: pralabhkumar
>Priority: Minor
> Fix For: 3.4.0
>
>
> In org.apache.spark.util  getConfiguredLocalDirs  
>  
> {code:java}
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so we 
> must set it
>   // to what Yarn on this system said was available. Note this assumes that 
> Yarn has
>   // created the directories already, and that they are secured so that only 
> the
>   // user has access to them.
>   randomizeInPlace(getYarnLocalDirs(conf).split(","))
> } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
>   conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
> } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
>   conf.getenv("SPARK_LOCAL_DIRS").split(",")
> }{code}
> randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") .  
> This is what used in case of K8s and the shuffle locations are not 
> randomized. 
> IMHO , this should be randomized , so that all the directories have equal 
> changes of pushing the data as was done on yarn side 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39755) Improve LocalDirsFeatureStep to randomize local directories

2022-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-39755:
--
Component/s: (was: Spark Core)

> Improve LocalDirsFeatureStep to randomize local directories
> ---
>
> Key: SPARK-39755
> URL: https://issues.apache.org/jira/browse/SPARK-39755
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: pralabhkumar
>Priority: Minor
> Fix For: 3.4.0
>
>
> In org.apache.spark.util  getConfiguredLocalDirs  
>  
> {code:java}
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so we 
> must set it
>   // to what Yarn on this system said was available. Note this assumes that 
> Yarn has
>   // created the directories already, and that they are secured so that only 
> the
>   // user has access to them.
>   randomizeInPlace(getYarnLocalDirs(conf).split(","))
> } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
>   conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
> } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
>   conf.getenv("SPARK_LOCAL_DIRS").split(",")
> }{code}
> randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") .  
> This is what used in case of K8s and the shuffle locations are not 
> randomized. 
> IMHO , this should be randomized , so that all the directories have equal 
> changes of pushing the data as was done on yarn side 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39755) Improve LocalDirsFeatureStep to randomize local directories

2022-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-39755:
--
Summary: Improve LocalDirsFeatureStep to randomize local directories  (was: 
SPARK_LOCAL_DIRS locations are not randomized in K8s)

> Improve LocalDirsFeatureStep to randomize local directories
> ---
>
> Key: SPARK-39755
> URL: https://issues.apache.org/jira/browse/SPARK-39755
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: pralabhkumar
>Priority: Minor
> Fix For: 3.4.0
>
>
> In org.apache.spark.util  getConfiguredLocalDirs  
>  
> {code:java}
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so we 
> must set it
>   // to what Yarn on this system said was available. Note this assumes that 
> Yarn has
>   // created the directories already, and that they are secured so that only 
> the
>   // user has access to them.
>   randomizeInPlace(getYarnLocalDirs(conf).split(","))
> } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
>   conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
> } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
>   conf.getenv("SPARK_LOCAL_DIRS").split(",")
> }{code}
> randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") .  
> This is what used in case of K8s and the shuffle locations are not 
> randomized. 
> IMHO , this should be randomized , so that all the directories have equal 
> changes of pushing the data as was done on yarn side 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39755) SPARK_LOCAL_DIRS locations are not randomized in K8s

2022-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-39755:
--
Issue Type: Improvement  (was: Bug)

> SPARK_LOCAL_DIRS locations are not randomized in K8s
> 
>
> Key: SPARK-39755
> URL: https://issues.apache.org/jira/browse/SPARK-39755
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: pralabhkumar
>Priority: Minor
> Fix For: 3.4.0
>
>
> In org.apache.spark.util  getConfiguredLocalDirs  
>  
> {code:java}
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so we 
> must set it
>   // to what Yarn on this system said was available. Note this assumes that 
> Yarn has
>   // created the directories already, and that they are secured so that only 
> the
>   // user has access to them.
>   randomizeInPlace(getYarnLocalDirs(conf).split(","))
> } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
>   conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
> } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
>   conf.getenv("SPARK_LOCAL_DIRS").split(",")
> }{code}
> randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") .  
> This is what used in case of K8s and the shuffle locations are not 
> randomized. 
> IMHO , this should be randomized , so that all the directories have equal 
> changes of pushing the data as was done on yarn side 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39755) SPARK_LOCAL_DIRS locations are not randomized in K8s

2022-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-39755.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37203
[https://github.com/apache/spark/pull/37203]

> SPARK_LOCAL_DIRS locations are not randomized in K8s
> 
>
> Key: SPARK-39755
> URL: https://issues.apache.org/jira/browse/SPARK-39755
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: pralabhkumar
>Priority: Minor
> Fix For: 3.4.0
>
>
> In org.apache.spark.util  getConfiguredLocalDirs  
>  
> {code:java}
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so we 
> must set it
>   // to what Yarn on this system said was available. Note this assumes that 
> Yarn has
>   // created the directories already, and that they are secured so that only 
> the
>   // user has access to them.
>   randomizeInPlace(getYarnLocalDirs(conf).split(","))
> } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
>   conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
> } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
>   conf.getenv("SPARK_LOCAL_DIRS").split(",")
> }{code}
> randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") .  
> This is what used in case of K8s and the shuffle locations are not 
> randomized. 
> IMHO , this should be randomized , so that all the directories have equal 
> changes of pushing the data as was done on yarn side 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39755) SPARK_LOCAL_DIRS locations are not randomized in K8s

2022-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-39755:
-

Assignee: pralabhkumar

> SPARK_LOCAL_DIRS locations are not randomized in K8s
> 
>
> Key: SPARK-39755
> URL: https://issues.apache.org/jira/browse/SPARK-39755
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.3.0
>Reporter: pralabhkumar
>Assignee: pralabhkumar
>Priority: Minor
>
> In org.apache.spark.util  getConfiguredLocalDirs  
>  
> {code:java}
> if (isRunningInYarnContainer(conf)) {
>   // If we are in yarn mode, systems can have different disk layouts so we 
> must set it
>   // to what Yarn on this system said was available. Note this assumes that 
> Yarn has
>   // created the directories already, and that they are secured so that only 
> the
>   // user has access to them.
>   randomizeInPlace(getYarnLocalDirs(conf).split(","))
> } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
>   conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
> } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
>   conf.getenv("SPARK_LOCAL_DIRS").split(",")
> }{code}
> randomizedInplace is not called conf.getenv("SPARK_LOCAL_DIRS").split(",") .  
> This is what used in case of K8s and the shuffle locations are not 
> randomized. 
> IMHO , this should be randomized , so that all the directories have equal 
> changes of pushing the data as was done on yarn side 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40179) Run / Scala 2.13 build with SBT GA failed

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40179:


Assignee: Apache Spark

> Run / Scala 2.13 build with SBT GA failed
> -
>
> Key: SPARK-40179
> URL: https://issues.apache.org/jira/browse/SPARK-40179
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
>  
> {code:java}
> [error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/HttpAuthUtils.java:36:1:
>   error: package org.apache.http.protocol does not exist
> 1011[error] import org.apache.http.protocol.BasicHttpContext;
> 1012[error]^
> 1013[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/HttpAuthUtils.java:156:1:
>   error: cannot find symbol
> 1014[error] private final HttpContext httpContext;
> 1015[error]   ^  symbol:   class HttpContext
> 1016[error]   location: class HttpKerberosClientAction
> 1017[error] 3 errors {code}
>  
>  * [https://github.com/apache/spark/runs/7947684467?check_suite_focus=true]
>  * [https://github.com/apache/spark/runs/7947300886?check_suite_focus=true]
>  * [https://github.com/apache/spark/runs/7946453241?check_suite_focus=true]
>  * [https://github.com/apache/spark/runs/7946444061?check_suite_focus=true]
>  
> But local run
>  
> {code:java}
> ./dev/change-scala-version.sh 2.13
> ./build/sbt -Pyarn -Pmesos -Pkubernetes -Pvolcano -Phive -Phive-thriftserver 
> -Phadoop-cloud -Pkinesis-asl -Pdocker-integration-tests 
> -Pkubernetes-integration-tests -Pspark-ganglia-lgpl -Pscala-2.13 compile 
> Test/compile
>  {code}
> can pass. Maybe cache file corrupt ？
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40179) Run / Scala 2.13 build with SBT GA failed

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40179:


Assignee: (was: Apache Spark)

> Run / Scala 2.13 build with SBT GA failed
> -
>
> Key: SPARK-40179
> URL: https://issues.apache.org/jira/browse/SPARK-40179
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
>  
> {code:java}
> [error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/HttpAuthUtils.java:36:1:
>   error: package org.apache.http.protocol does not exist
> 1011[error] import org.apache.http.protocol.BasicHttpContext;
> 1012[error]^
> 1013[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/HttpAuthUtils.java:156:1:
>   error: cannot find symbol
> 1014[error] private final HttpContext httpContext;
> 1015[error]   ^  symbol:   class HttpContext
> 1016[error]   location: class HttpKerberosClientAction
> 1017[error] 3 errors {code}
>  
>  * [https://github.com/apache/spark/runs/7947684467?check_suite_focus=true]
>  * [https://github.com/apache/spark/runs/7947300886?check_suite_focus=true]
>  * [https://github.com/apache/spark/runs/7946453241?check_suite_focus=true]
>  * [https://github.com/apache/spark/runs/7946444061?check_suite_focus=true]
>  
> But local run
>  
> {code:java}
> ./dev/change-scala-version.sh 2.13
> ./build/sbt -Pyarn -Pmesos -Pkubernetes -Pvolcano -Phive -Phive-thriftserver 
> -Phadoop-cloud -Pkinesis-asl -Pdocker-integration-tests 
> -Pkubernetes-integration-tests -Pspark-ganglia-lgpl -Pscala-2.13 compile 
> Test/compile
>  {code}
> can pass. Maybe cache file corrupt ？
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40179) Run / Scala 2.13 build with SBT GA failed

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40179:


Assignee: Apache Spark

> Run / Scala 2.13 build with SBT GA failed
> -
>
> Key: SPARK-40179
> URL: https://issues.apache.org/jira/browse/SPARK-40179
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
>  
> {code:java}
> [error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/HttpAuthUtils.java:36:1:
>   error: package org.apache.http.protocol does not exist
> 1011[error] import org.apache.http.protocol.BasicHttpContext;
> 1012[error]^
> 1013[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/HttpAuthUtils.java:156:1:
>   error: cannot find symbol
> 1014[error] private final HttpContext httpContext;
> 1015[error]   ^  symbol:   class HttpContext
> 1016[error]   location: class HttpKerberosClientAction
> 1017[error] 3 errors {code}
>  
>  * [https://github.com/apache/spark/runs/7947684467?check_suite_focus=true]
>  * [https://github.com/apache/spark/runs/7947300886?check_suite_focus=true]
>  * [https://github.com/apache/spark/runs/7946453241?check_suite_focus=true]
>  * [https://github.com/apache/spark/runs/7946444061?check_suite_focus=true]
>  
> But local run
>  
> {code:java}
> ./dev/change-scala-version.sh 2.13
> ./build/sbt -Pyarn -Pmesos -Pkubernetes -Pvolcano -Phive -Phive-thriftserver 
> -Phadoop-cloud -Pkinesis-asl -Pdocker-integration-tests 
> -Pkubernetes-integration-tests -Pspark-ganglia-lgpl -Pscala-2.13 compile 
> Test/compile
>  {code}
> can pass. Maybe cache file corrupt ？
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40179) Run / Scala 2.13 build with SBT GA failed

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582951#comment-17582951
 ] 

Apache Spark commented on SPARK-40179:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37617

> Run / Scala 2.13 build with SBT GA failed
> -
>
> Key: SPARK-40179
> URL: https://issues.apache.org/jira/browse/SPARK-40179
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
>  
> {code:java}
> [error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/HttpAuthUtils.java:36:1:
>   error: package org.apache.http.protocol does not exist
> 1011[error] import org.apache.http.protocol.BasicHttpContext;
> 1012[error]^
> 1013[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/HttpAuthUtils.java:156:1:
>   error: cannot find symbol
> 1014[error] private final HttpContext httpContext;
> 1015[error]   ^  symbol:   class HttpContext
> 1016[error]   location: class HttpKerberosClientAction
> 1017[error] 3 errors {code}
>  
>  * [https://github.com/apache/spark/runs/7947684467?check_suite_focus=true]
>  * [https://github.com/apache/spark/runs/7947300886?check_suite_focus=true]
>  * [https://github.com/apache/spark/runs/7946453241?check_suite_focus=true]
>  * [https://github.com/apache/spark/runs/7946444061?check_suite_focus=true]
>  
> But local run
>  
> {code:java}
> ./dev/change-scala-version.sh 2.13
> ./build/sbt -Pyarn -Pmesos -Pkubernetes -Pvolcano -Phive -Phive-thriftserver 
> -Phadoop-cloud -Pkinesis-asl -Pdocker-integration-tests 
> -Pkubernetes-integration-tests -Pspark-ganglia-lgpl -Pscala-2.13 compile 
> Test/compile
>  {code}
> can pass. Maybe cache file corrupt ？
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-40170) StringCoding UTF8 decode slowly

2022-08-22 Thread caican (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582935#comment-17582935
 ] 

caican edited comment on SPARK-40170 at 8/22/22 12:13 PM:
--

[~kabhwan]

My program code is very simple，as shown below.

```

val rdd = spark.sql("select triggerId,adMetadata,userData from 
iceberg_my_cloud.mydb.myTable where date = 20220801").rdd
println(rdd.count())

```

In addition to string decode, the conversion of Tuple2 to MAP is slow and i 
have submitted a patch:https://github.com/apache/spark/pull/37609 to optimize 
it but right now I don't have a good way to optimize string decode


was (Author: JIRAUSER280464):
My program code is very simple，As shown below.

```

val rdd = spark.sql("select triggerId,adMetadata,userData from 
iceberg_my_cloud.mydb.myTable where date = 20220801").rdd
println(rdd.count())

```

> StringCoding UTF8 decode slowly
> ---
>
> Key: SPARK-40170
> URL: https://issues.apache.org/jira/browse/SPARK-40170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1
>Reporter: caican
>Priority: Major
> Attachments: image-2022-08-22-10-56-54-768.png, 
> image-2022-08-22-10-57-11-744.png
>
>
> When `UnsafeRow` is converted to `Row` at 
> `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow
>  `,  UTF8String decoding and copyMemory  process are very slow.
> Does anyone have any ideas for optimization?
> !image-2022-08-22-10-56-54-768.png!
>  
> !image-2022-08-22-10-57-11-744.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40170) StringCoding UTF8 decode slowly

2022-08-22 Thread caican (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582935#comment-17582935
 ] 

caican commented on SPARK-40170:


My program code is very simple，As shown below.

```

val rdd = spark.sql("select triggerId,adMetadata,userData from 
iceberg_my_cloud.mydb.myTable where date = 20220801").rdd
println(rdd.count())

```

> StringCoding UTF8 decode slowly
> ---
>
> Key: SPARK-40170
> URL: https://issues.apache.org/jira/browse/SPARK-40170
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2, 3.3.1
>Reporter: caican
>Priority: Major
> Attachments: image-2022-08-22-10-56-54-768.png, 
> image-2022-08-22-10-57-11-744.png
>
>
> When `UnsafeRow` is converted to `Row` at 
> `org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.createExternalRow
>  `,  UTF8String decoding and copyMemory  process are very slow.
> Does anyone have any ideas for optimization?
> !image-2022-08-22-10-56-54-768.png!
>  
> !image-2022-08-22-10-57-11-744.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40179) Run / Scala 2.13 build with SBT GA failed

2022-08-22 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582933#comment-17582933
 ] 

Yang Jie commented on SPARK-40179:
--

ping [~hyukjin.kwon] , Maybe file cache corrupt ？Can we clean local repository 
cache manually?

 

> Run / Scala 2.13 build with SBT GA failed
> -
>
> Key: SPARK-40179
> URL: https://issues.apache.org/jira/browse/SPARK-40179
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
>
>  
> {code:java}
> [error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/HttpAuthUtils.java:36:1:
>   error: package org.apache.http.protocol does not exist
> 1011[error] import org.apache.http.protocol.BasicHttpContext;
> 1012[error]^
> 1013[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/HttpAuthUtils.java:156:1:
>   error: cannot find symbol
> 1014[error] private final HttpContext httpContext;
> 1015[error]   ^  symbol:   class HttpContext
> 1016[error]   location: class HttpKerberosClientAction
> 1017[error] 3 errors {code}
>  
>  * [https://github.com/apache/spark/runs/7947684467?check_suite_focus=true]
>  * [https://github.com/apache/spark/runs/7947300886?check_suite_focus=true]
>  * [https://github.com/apache/spark/runs/7946453241?check_suite_focus=true]
>  * [https://github.com/apache/spark/runs/7946444061?check_suite_focus=true]
>  
> But local run
>  
> {code:java}
> ./dev/change-scala-version.sh 2.13
> ./build/sbt -Pyarn -Pmesos -Pkubernetes -Pvolcano -Phive -Phive-thriftserver 
> -Phadoop-cloud -Pkinesis-asl -Pdocker-integration-tests 
> -Pkubernetes-integration-tests -Pspark-ganglia-lgpl -Pscala-2.13 compile 
> Test/compile
>  {code}
> can pass. Maybe cache file corrupt ？
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40178) Rebalance/Repartition Hints Not Working in PySpark

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40178:


Assignee: Apache Spark

> Rebalance/Repartition Hints Not Working in PySpark
> --
>
> Key: SPARK-40178
> URL: https://issues.apache.org/jira/browse/SPARK-40178
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
> Environment: Mac OSX 11.4 Big Sur
> Python 3.9.7
> Spark version >= 3.2.0 (perhaps before as well).
>Reporter: Maxwell Conradt
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.0, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Partitioning hints in PySpark do not work because the column parameters are 
> not converted to Catalyst `Expression` instances before being passed to the 
> hint resolver.
> The behavior of the hints is documented 
> [here|https://spark.apache.org/docs/3.3.0/sql-ref-syntax-qry-select-hints.html#partitioning-hints-types].
> Example:
>  
> {code:java}
> >>> df = spark.range(1024)
> >>> 
> >>> df
> DataFrame[id: bigint]
> >>> df.hint("rebalance", "id")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/dataframe.py", line 
> 980, in hint
>     jdf = self._jdf.hint(name, self._jseq(parameters))
>   File 
> "/Users/maxwellconradt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
>  line 1322, in __call__
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/utils.py", line 196, 
> in deco
>     raise converted from None
> pyspark.sql.utils.AnalysisException: REBALANCE Hint parameter should include 
> columns, but id found
> >>> df.hint("repartition", "id")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/dataframe.py", line 
> 980, in hint
>     jdf = self._jdf.hint(name, self._jseq(parameters))
>   File 
> "/Users/maxwellconradt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
>  line 1322, in __call__
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/utils.py", line 196, 
> in deco
>     raise converted from None
> pyspark.sql.utils.AnalysisException: REPARTITION Hint parameter should 
> include columns, but id found {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40178) Rebalance/Repartition Hints Not Working in PySpark

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582932#comment-17582932
 ] 

Apache Spark commented on SPARK-40178:
--

User 'mhconradt' has created a pull request for this issue:
https://github.com/apache/spark/pull/37616

> Rebalance/Repartition Hints Not Working in PySpark
> --
>
> Key: SPARK-40178
> URL: https://issues.apache.org/jira/browse/SPARK-40178
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
> Environment: Mac OSX 11.4 Big Sur
> Python 3.9.7
> Spark version >= 3.2.0 (perhaps before as well).
>Reporter: Maxwell Conradt
>Priority: Major
> Fix For: 3.2.0, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Partitioning hints in PySpark do not work because the column parameters are 
> not converted to Catalyst `Expression` instances before being passed to the 
> hint resolver.
> The behavior of the hints is documented 
> [here|https://spark.apache.org/docs/3.3.0/sql-ref-syntax-qry-select-hints.html#partitioning-hints-types].
> Example:
>  
> {code:java}
> >>> df = spark.range(1024)
> >>> 
> >>> df
> DataFrame[id: bigint]
> >>> df.hint("rebalance", "id")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/dataframe.py", line 
> 980, in hint
>     jdf = self._jdf.hint(name, self._jseq(parameters))
>   File 
> "/Users/maxwellconradt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
>  line 1322, in __call__
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/utils.py", line 196, 
> in deco
>     raise converted from None
> pyspark.sql.utils.AnalysisException: REBALANCE Hint parameter should include 
> columns, but id found
> >>> df.hint("repartition", "id")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/dataframe.py", line 
> 980, in hint
>     jdf = self._jdf.hint(name, self._jseq(parameters))
>   File 
> "/Users/maxwellconradt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
>  line 1322, in __call__
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/utils.py", line 196, 
> in deco
>     raise converted from None
> pyspark.sql.utils.AnalysisException: REPARTITION Hint parameter should 
> include columns, but id found {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40178) Rebalance/Repartition Hints Not Working in PySpark

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40178:


Assignee: (was: Apache Spark)

> Rebalance/Repartition Hints Not Working in PySpark
> --
>
> Key: SPARK-40178
> URL: https://issues.apache.org/jira/browse/SPARK-40178
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
> Environment: Mac OSX 11.4 Big Sur
> Python 3.9.7
> Spark version >= 3.2.0 (perhaps before as well).
>Reporter: Maxwell Conradt
>Priority: Major
> Fix For: 3.2.0, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Partitioning hints in PySpark do not work because the column parameters are 
> not converted to Catalyst `Expression` instances before being passed to the 
> hint resolver.
> The behavior of the hints is documented 
> [here|https://spark.apache.org/docs/3.3.0/sql-ref-syntax-qry-select-hints.html#partitioning-hints-types].
> Example:
>  
> {code:java}
> >>> df = spark.range(1024)
> >>> 
> >>> df
> DataFrame[id: bigint]
> >>> df.hint("rebalance", "id")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/dataframe.py", line 
> 980, in hint
>     jdf = self._jdf.hint(name, self._jseq(parameters))
>   File 
> "/Users/maxwellconradt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
>  line 1322, in __call__
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/utils.py", line 196, 
> in deco
>     raise converted from None
> pyspark.sql.utils.AnalysisException: REBALANCE Hint parameter should include 
> columns, but id found
> >>> df.hint("repartition", "id")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/dataframe.py", line 
> 980, in hint
>     jdf = self._jdf.hint(name, self._jseq(parameters))
>   File 
> "/Users/maxwellconradt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
>  line 1322, in __call__
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/utils.py", line 196, 
> in deco
>     raise converted from None
> pyspark.sql.utils.AnalysisException: REPARTITION Hint parameter should 
> include columns, but id found {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40179) Run / Scala 2.13 build with SBT GA failed

2022-08-22 Thread Yang Jie (Jira)

Yang Jie created SPARK-40179:


 Summary: Run / Scala 2.13 build with SBT GA failed
 Key: SPARK-40179
 URL: https://issues.apache.org/jira/browse/SPARK-40179
 Project: Spark
  Issue Type: Bug
  Components: Build, Tests
Affects Versions: 3.4.0
Reporter: Yang Jie


 
{code:java}
[error] 
/home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/HttpAuthUtils.java:36:1:
  error: package org.apache.http.protocol does not exist
1011[error] import org.apache.http.protocol.BasicHttpContext;
1012[error]^
1013[error] 
/home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/auth/HttpAuthUtils.java:156:1:
  error: cannot find symbol
1014[error] private final HttpContext httpContext;
1015[error]   ^  symbol:   class HttpContext
1016[error]   location: class HttpKerberosClientAction
1017[error] 3 errors {code}
 
 * [https://github.com/apache/spark/runs/7947684467?check_suite_focus=true]
 * [https://github.com/apache/spark/runs/7947300886?check_suite_focus=true]
 * [https://github.com/apache/spark/runs/7946453241?check_suite_focus=true]
 * [https://github.com/apache/spark/runs/7946444061?check_suite_focus=true]

 

But local run

 
{code:java}
./dev/change-scala-version.sh 2.13
./build/sbt -Pyarn -Pmesos -Pkubernetes -Pvolcano -Phive -Phive-thriftserver 
-Phadoop-cloud -Pkinesis-asl -Pdocker-integration-tests 
-Pkubernetes-integration-tests -Pspark-ganglia-lgpl -Pscala-2.13 compile 
Test/compile
 {code}
can pass. Maybe cache file corrupt ？

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40124) Update TPCDS v1.4 q32 for Plan Stability tests

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582929#comment-17582929
 ] 

Apache Spark commented on SPARK-40124:
--

User 'mskapilks' has created a pull request for this issue:
https://github.com/apache/spark/pull/37615

> Update TPCDS v1.4 q32 for Plan Stability tests
> --
>
> Key: SPARK-40124
> URL: https://issues.apache.org/jira/browse/SPARK-40124
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kapil Singh
>Assignee: Kapil Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40178) Rebalance/Repartition Hints Not Working in PySpark

2022-08-22 Thread Maxwell Conradt (Jira)

Maxwell Conradt created SPARK-40178:
---

 Summary: Rebalance/Repartition Hints Not Working in PySpark
 Key: SPARK-40178
 URL: https://issues.apache.org/jira/browse/SPARK-40178
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.2, 3.3.0, 3.2.1, 3.2.0
 Environment: Mac OSX 11.4 Big Sur

Python 3.9.7

Spark version >= 3.2.0 (perhaps before as well).
Reporter: Maxwell Conradt
 Fix For: 3.4.0, 3.3.1, 3.2.2, 3.3.0, 3.2.1, 3.2.0


Partitioning hints in PySpark do not work because the column parameters are not 
converted to Catalyst `Expression` instances before being passed to the hint 
resolver.

The behavior of the hints is documented 
[here|https://spark.apache.org/docs/3.3.0/sql-ref-syntax-qry-select-hints.html#partitioning-hints-types].

Example:

 
{code:java}
>>> df = spark.range(1024)
>>> 
>>> df
DataFrame[id: bigint]
>>> df.hint("rebalance", "id")
Traceback (most recent call last):
  File "", line 1, in 
  File "/Users/maxwellconradt/spark/python/pyspark/sql/dataframe.py", line 980, 
in hint
    jdf = self._jdf.hint(name, self._jseq(parameters))
  File 
"/Users/maxwellconradt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
 line 1322, in __call__
  File "/Users/maxwellconradt/spark/python/pyspark/sql/utils.py", line 196, in 
deco
    raise converted from None
pyspark.sql.utils.AnalysisException: REBALANCE Hint parameter should include 
columns, but id found
>>> df.hint("repartition", "id")
Traceback (most recent call last):
  File "", line 1, in 
  File "/Users/maxwellconradt/spark/python/pyspark/sql/dataframe.py", line 980, 
in hint
    jdf = self._jdf.hint(name, self._jseq(parameters))
  File 
"/Users/maxwellconradt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
 line 1322, in __call__
  File "/Users/maxwellconradt/spark/python/pyspark/sql/utils.py", line 196, in 
deco
    raise converted from None
pyspark.sql.utils.AnalysisException: REPARTITION Hint parameter should include 
columns, but id found {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38992) Avoid using bash -c in ShellBasedGroupsMappingProvider

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582924#comment-17582924
 ] 

Apache Spark commented on SPARK-38992:
--

User 'leoluan2009' has created a pull request for this issue:
https://github.com/apache/spark/pull/37614

> Avoid using bash -c in ShellBasedGroupsMappingProvider
> --
>
> Key: SPARK-38992
> URL: https://issues.apache.org/jira/browse/SPARK-38992
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.3, 3.0.4, 3.3.0, 3.2.2
>
>
> Using bash -c can allow arbitrary shall execution from the end user.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37944) Use error classes in the execution errors of casting

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582884#comment-17582884
 ] 

Apache Spark commented on SPARK-37944:
--

User 'goutam-git' has created a pull request for this issue:
https://github.com/apache/spark/pull/37613

> Use error classes in the execution errors of casting
> 
>
> Key: SPARK-37944
> URL: https://issues.apache.org/jira/browse/SPARK-37944
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * failedToCastValueToDataTypeForPartitionColumnError
> * invalidInputSyntaxForNumericError
> * cannotCastToDateTimeError
> * invalidInputSyntaxForBooleanError
> * nullLiteralsCannotBeCastedError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37944) Use error classes in the execution errors of casting

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37944:


Assignee: (was: Apache Spark)

> Use error classes in the execution errors of casting
> 
>
> Key: SPARK-37944
> URL: https://issues.apache.org/jira/browse/SPARK-37944
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * failedToCastValueToDataTypeForPartitionColumnError
> * invalidInputSyntaxForNumericError
> * cannotCastToDateTimeError
> * invalidInputSyntaxForBooleanError
> * nullLiteralsCannotBeCastedError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37944) Use error classes in the execution errors of casting

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37944:


Assignee: Apache Spark

> Use error classes in the execution errors of casting
> 
>
> Key: SPARK-37944
> URL: https://issues.apache.org/jira/browse/SPARK-37944
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Migrate the following errors in QueryExecutionErrors:
> * failedToCastValueToDataTypeForPartitionColumnError
> * invalidInputSyntaxForNumericError
> * cannotCastToDateTimeError
> * invalidInputSyntaxForBooleanError
> * nullLiteralsCannotBeCastedError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryExecutionErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39915) Dataset.repartition(N) may not create N partitions

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39915:


Assignee: Apache Spark

> Dataset.repartition(N) may not create N partitions
> --
>
> Key: SPARK-39915
> URL: https://issues.apache.org/jira/browse/SPARK-39915
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> Looks like there is a behavior change in Dataset.repartition in 3.3.0. For 
> example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 
> in Spark 3.2.0, but 0 in Spark 3.3.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39915) Dataset.repartition(N) may not create N partitions

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39915:


Assignee: (was: Apache Spark)

> Dataset.repartition(N) may not create N partitions
> --
>
> Key: SPARK-39915
> URL: https://issues.apache.org/jira/browse/SPARK-39915
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> Looks like there is a behavior change in Dataset.repartition in 3.3.0. For 
> example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 
> in Spark 3.2.0, but 0 in Spark 3.3.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39915) Dataset.repartition(N) may not create N partitions

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582864#comment-17582864
 ] 

Apache Spark commented on SPARK-39915:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/37612

> Dataset.repartition(N) may not create N partitions
> --
>
> Key: SPARK-39915
> URL: https://issues.apache.org/jira/browse/SPARK-39915
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> Looks like there is a behavior change in Dataset.repartition in 3.3.0. For 
> example, `spark.range(10, 0).repartition(5).rdd.getNumPartitions` returns 5 
> in Spark 3.2.0, but 0 in Spark 3.3.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38752) Test the error class: UNSUPPORTED_DATATYPE

2022-08-22 Thread lvshaokang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582851#comment-17582851
 ] 

lvshaokang commented on SPARK-38752:


[~maxgekk]  I am working on this.

> Test the error class: UNSUPPORTED_DATATYPE
> --
>
> Key: SPARK-38752
> URL: https://issues.apache.org/jira/browse/SPARK-38752
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add a test for the error classes *UNSUPPORTED_DATATYPE* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def dataTypeUnsupportedError(dataType: String, failure: String): Throwable 
> = {
> new SparkIllegalArgumentException(errorClass = "UNSUPPORTED_DATATYPE",
>   messageParameters = Array(dataType + failure))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38749) Test the error class: RENAME_SRC_PATH_NOT_FOUND

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38749:


Assignee: Apache Spark

> Test the error class: RENAME_SRC_PATH_NOT_FOUND
> ---
>
> Key: SPARK-38749
> URL: https://issues.apache.org/jira/browse/SPARK-38749
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Add a test for the error classes *RENAME_SRC_PATH_NOT_FOUND* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def renameSrcPathNotFoundError(srcPath: Path): Throwable = {
> new SparkFileNotFoundException(errorClass = "RENAME_SRC_PATH_NOT_FOUND",
>   Array(srcPath.toString))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38749) Test the error class: RENAME_SRC_PATH_NOT_FOUND

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38749:


Assignee: (was: Apache Spark)

> Test the error class: RENAME_SRC_PATH_NOT_FOUND
> ---
>
> Key: SPARK-38749
> URL: https://issues.apache.org/jira/browse/SPARK-38749
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add a test for the error classes *RENAME_SRC_PATH_NOT_FOUND* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def renameSrcPathNotFoundError(srcPath: Path): Throwable = {
> new SparkFileNotFoundException(errorClass = "RENAME_SRC_PATH_NOT_FOUND",
>   Array(srcPath.toString))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38749) Test the error class: RENAME_SRC_PATH_NOT_FOUND

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38749:


Assignee: Apache Spark

> Test the error class: RENAME_SRC_PATH_NOT_FOUND
> ---
>
> Key: SPARK-38749
> URL: https://issues.apache.org/jira/browse/SPARK-38749
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Add a test for the error classes *RENAME_SRC_PATH_NOT_FOUND* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def renameSrcPathNotFoundError(srcPath: Path): Throwable = {
> new SparkFileNotFoundException(errorClass = "RENAME_SRC_PATH_NOT_FOUND",
>   Array(srcPath.toString))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38749) Test the error class: RENAME_SRC_PATH_NOT_FOUND

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582846#comment-17582846
 ] 

Apache Spark commented on SPARK-38749:
--

User 'lvshaokang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37611

> Test the error class: RENAME_SRC_PATH_NOT_FOUND
> ---
>
> Key: SPARK-38749
> URL: https://issues.apache.org/jira/browse/SPARK-38749
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add a test for the error classes *RENAME_SRC_PATH_NOT_FOUND* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def renameSrcPathNotFoundError(srcPath: Path): Throwable = {
> new SparkFileNotFoundException(errorClass = "RENAME_SRC_PATH_NOT_FOUND",
>   Array(srcPath.toString))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40089) Sorting of at least Decimal(20, 2) fails for some values near the max.

2022-08-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40089.
-
Fix Version/s: 3.3.1
   3.1.4
   3.2.3
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37540
[https://github.com/apache/spark/pull/37540]

> Sorting of at least Decimal(20, 2) fails for some values near the max.
> --
>
> Key: SPARK-40089
> URL: https://issues.apache.org/jira/browse/SPARK-40089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0, 3.4.0
>Reporter: Robert Joseph Evans
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.1, 3.1.4, 3.2.3, 3.4.0
>
> Attachments: input.parquet
>
>
> I have been doing some testing with Decimal values for the RAPIDS Accelerator 
> for Apache Spark. I have been trying to add in new corner cases and when I 
> tried to enable the maximum supported value for a sort I started to get 
> failures.  On closer inspection it looks like the CPU is sorting things 
> incorrectly.  Specifically anything that is "99.50" or above 
> is placed as a chunk in the wrong location in the outputs.
>  In local mode with 12 tasks.
> {code:java}
> spark.read.parquet("input.parquet").orderBy(col("a")).collect.foreach(System.err.println)
>  {code}
>  
> Here you will notice that the last entry printed is 
> {{[99.49]}}, and {{[99.99]}} is near the top 
> near {{[-99.99]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40177) Simplify join condition of form (a==b) || (a==null&==null) to a<=>b

2022-08-22 Thread Ayushi Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582799#comment-17582799
 ] 

Ayushi Agarwal commented on SPARK-40177:


Working on the PR

> Simplify join condition of form (a==b) || (a==null&==null) to a<=>b
> -
>
> Key: SPARK-40177
> URL: https://issues.apache.org/jira/browse/SPARK-40177
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Ayushi Agarwal
>Priority: Major
> Fix For: 3.3.1
>
>
> If the join condition is like key1==key2 || (key1==null && key2==null), join 
> is executed as Broadcast Nested Loop Join as this condition doesn't satisfy 
> equi join condition. BNLJ takes more time as compared to Sort merge or 
> broadcast join. This condition can be converted to key1<=>key2 to make the 
> join execute as Broadcast or sort merge join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40177) Simplify join condition of form (a==b) || (a==null&==null) to a<=>b

2022-08-22 Thread Ayushi Agarwal (Jira)

Ayushi Agarwal created SPARK-40177:
--

 Summary: Simplify join condition of form (a==b) || 
(a==null&==null) to a<=>b
 Key: SPARK-40177
 URL: https://issues.apache.org/jira/browse/SPARK-40177
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.3.0, 3.2.0
Reporter: Ayushi Agarwal
 Fix For: 3.3.1


If the join condition is like key1==key2 || (key1==null && key2==null), join is 
executed as Broadcast Nested Loop Join as this condition doesn't satisfy equi 
join condition. BNLJ takes more time as compared to Sort merge or broadcast 
join. This condition can be converted to key1<=>key2 to make the join execute 
as Broadcast or sort merge join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40151) Fix return type for new median(interval) function

2022-08-22 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40151.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37595
[https://github.com/apache/spark/pull/37595]

> Fix return type for new median(interval) function 
> --
>
> Key: SPARK-40151
> URL: https://issues.apache.org/jira/browse/SPARK-40151
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Assignee: Max Gekk
>Priority: Critical
> Fix For: 3.4.0
>
>
> median() right now returns an interval of the same type as the input.
> We should instead match mean and avg():
> The result type is computed as for the arguments:
> - year-month interval: The result is an `INTERVAL YEAR TO MONTH`.
> - day-time interval: The result is an `INTERVAL DAY TO SECOND`.
> - In all other cases the result is a DOUBLE.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40151) Fix return type for new median(interval) function

2022-08-22 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40151:


Assignee: Max Gekk

> Fix return type for new median(interval) function 
> --
>
> Key: SPARK-40151
> URL: https://issues.apache.org/jira/browse/SPARK-40151
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Assignee: Max Gekk
>Priority: Critical
>
> median() right now returns an interval of the same type as the input.
> We should instead match mean and avg():
> The result type is computed as for the arguments:
> - year-month interval: The result is an `INTERVAL YEAR TO MONTH`.
> - day-time interval: The result is an `INTERVAL DAY TO SECOND`.
> - In all other cases the result is a DOUBLE.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40176) Enhance collapse window optimization to work in case partition or order by keys are expressions

2022-08-22 Thread Ayushi Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582797#comment-17582797
 ] 

Ayushi Agarwal commented on SPARK-40176:


Working on this ticket

> Enhance collapse window optimization to work in case partition or order by 
> keys are expressions
> ---
>
> Key: SPARK-40176
> URL: https://issues.apache.org/jira/browse/SPARK-40176
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: Ayushi Agarwal
>Priority: Major
> Fix For: 3.3.1
>
>
> In window operator with multiple window functions, if any expression is 
> present in partition by or sort order columns, windows are not collapsed even 
> if partition and order by expression is same for all those window functions.
> E.g. query:
> val w = 
> Window.{_}partitionBy{_}("key").orderBy({_}lower{_}({_}col{_}("value")))
> df.select({_}lead{_}("key", 1).over(w), {_}lead{_}("value", 1).over(w))
> Current Plan:
> -Window(lead(value,1), key, _w1) -- W1
> - Sort (key, _w1)
> -Project (lower(“value”) as _w1) - P1
> -Window(lead(key,1), key, _w0)  W2
> -Sort(key, _w0)
> -Exchange(key)
> -Project (lower(“value”) as _w0)  P2
> -Scan
>  
> W1 and W2 can be merged in single window



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38888) Add `RocksDBProvider` similar to `LevelDBProvider`

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3:


Assignee: (was: Apache Spark)

> Add `RocksDBProvider` similar to `LevelDBProvider`
> --
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> `LevelDBProvider` is used by `ExternalShuffleBlockResolver` and 
> `YarnShuffleService`, a corresponding `RocksDB` implementation should be added



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38888) Add `RocksDBProvider` similar to `LevelDBProvider`

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3:


Assignee: Apache Spark

> Add `RocksDBProvider` similar to `LevelDBProvider`
> --
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> `LevelDBProvider` is used by `ExternalShuffleBlockResolver` and 
> `YarnShuffleService`, a corresponding `RocksDB` implementation should be added



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38888) Add `RocksDBProvider` similar to `LevelDBProvider`

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582796#comment-17582796
 ] 

Apache Spark commented on SPARK-3:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37610

> Add `RocksDBProvider` similar to `LevelDBProvider`
> --
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> `LevelDBProvider` is used by `ExternalShuffleBlockResolver` and 
> `YarnShuffleService`, a corresponding `RocksDB` implementation should be added



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40176) Enhance collapse window optimization to work in case partition or order by keys are expressions

2022-08-22 Thread Ayushi Agarwal (Jira)

Ayushi Agarwal created SPARK-40176:
--

 Summary: Enhance collapse window optimization to work in case 
partition or order by keys are expressions
 Key: SPARK-40176
 URL: https://issues.apache.org/jira/browse/SPARK-40176
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.3.0, 3.2.1, 3.2.0
Reporter: Ayushi Agarwal
 Fix For: 3.3.1


In window operator with multiple window functions, if any expression is present 
in partition by or sort order columns, windows are not collapsed even if 
partition and order by expression is same for all those window functions.

E.g. query:

val w = Window.{_}partitionBy{_}("key").orderBy({_}lower{_}({_}col{_}("value")))

df.select({_}lead{_}("key", 1).over(w), {_}lead{_}("value", 1).over(w))

Current Plan:

-Window(lead(value,1), key, _w1) -- W1

- Sort (key, _w1)

-Project (lower(“value”) as _w1) - P1

-Window(lead(key,1), key, _w0)  W2

-Sort(key, _w0)

-Exchange(key)

-Project (lower(“value”) as _w0)  P2

-Scan

 

W1 and W2 can be merged in single window



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40175) Converting Tuple2 to Scala Map via `.toMap` is slow

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582787#comment-17582787
 ] 

Apache Spark commented on SPARK-40175:
--

User 'caican00' has created a pull request for this issue:
https://github.com/apache/spark/pull/37609

> Converting Tuple2 to Scala Map via `.toMap` is slow
> ---
>
> Key: SPARK-40175
> URL: https://issues.apache.org/jira/browse/SPARK-40175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.0, 3.2.2, 3.3.1
>Reporter: caican
>Priority: Major
> Attachments: image-2022-08-22-14-58-26-491.png, 
> image-2022-08-22-14-58-53-046.png
>
>
> Converting Tuple2 to Scala Map via `.toMap` is slow
> !image-2022-08-22-14-58-53-046.png!
> !image-2022-08-22-14-58-26-491.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40175) Converting Tuple2 to Scala Map via `.toMap` is slow

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40175:


Assignee: (was: Apache Spark)

> Converting Tuple2 to Scala Map via `.toMap` is slow
> ---
>
> Key: SPARK-40175
> URL: https://issues.apache.org/jira/browse/SPARK-40175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.0, 3.2.2, 3.3.1
>Reporter: caican
>Priority: Major
> Attachments: image-2022-08-22-14-58-26-491.png, 
> image-2022-08-22-14-58-53-046.png
>
>
> Converting Tuple2 to Scala Map via `.toMap` is slow
> !image-2022-08-22-14-58-53-046.png!
> !image-2022-08-22-14-58-26-491.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40175) Converting Tuple2 to Scala Map via `.toMap` is slow

2022-08-22 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582781#comment-17582781
 ] 

Apache Spark commented on SPARK-40175:
--

User 'caican00' has created a pull request for this issue:
https://github.com/apache/spark/pull/37608

> Converting Tuple2 to Scala Map via `.toMap` is slow
> ---
>
> Key: SPARK-40175
> URL: https://issues.apache.org/jira/browse/SPARK-40175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.0, 3.2.2, 3.3.1
>Reporter: caican
>Priority: Major
> Attachments: image-2022-08-22-14-58-26-491.png, 
> image-2022-08-22-14-58-53-046.png
>
>
> Converting Tuple2 to Scala Map via `.toMap` is slow
> !image-2022-08-22-14-58-53-046.png!
> !image-2022-08-22-14-58-26-491.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40175) Converting Tuple2 to Scala Map via `.toMap` is slow

2022-08-22 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40175:


Assignee: Apache Spark

> Converting Tuple2 to Scala Map via `.toMap` is slow
> ---
>
> Key: SPARK-40175
> URL: https://issues.apache.org/jira/browse/SPARK-40175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.0, 3.2.2, 3.3.1
>Reporter: caican
>Assignee: Apache Spark
>Priority: Major
> Attachments: image-2022-08-22-14-58-26-491.png, 
> image-2022-08-22-14-58-53-046.png
>
>
> Converting Tuple2 to Scala Map via `.toMap` is slow
> !image-2022-08-22-14-58-53-046.png!
> !image-2022-08-22-14-58-26-491.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40172) Temporarily disable flaky test cases in ImageFileFormatSuite

2022-08-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40172.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37605
[https://github.com/apache/spark/pull/37605]

> Temporarily disable flaky test cases in ImageFileFormatSuite
> 
>
> Key: SPARK-40172
> URL: https://issues.apache.org/jira/browse/SPARK-40172
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.4.0
>
>
> 3 test cases in ImageFileFormatSuite become flaky in the GitHub action tests:
> [https://github.com/apache/spark/runs/7941765326?check_suite_focus=true]
> Before they are fixed(https://issues.apache.org/jira/browse/SPARK-40171), I 
> suggest disabling them in OSS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 108 matches

Mail list logo