[jira] [Assigned] (SPARK-40215) Add SQL configs to control CSV/JSON date and timestamp parsing behaviour

2022-08-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40215:


Assignee: Apache Spark

> Add SQL configs to control CSV/JSON date and timestamp parsing behaviour
> 
>
> Key: SPARK-40215
> URL: https://issues.apache.org/jira/browse/SPARK-40215
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40215) Add SQL configs to control CSV/JSON date and timestamp parsing behaviour

2022-08-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40215:


Assignee: (was: Apache Spark)

> Add SQL configs to control CSV/JSON date and timestamp parsing behaviour
> 
>
> Key: SPARK-40215
> URL: https://issues.apache.org/jira/browse/SPARK-40215
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40039) Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

2022-08-24 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-40039.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37474
[https://github.com/apache/spark/pull/37474]

> Introducing a streaming checkpoint file manager based on Hadoop's Abortable 
> interface
> -
>
> Key: SPARK-40039
> URL: https://issues.apache.org/jira/browse/SPARK-40039
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently on S3 the checkpoint file manager (called 
> FileContextBasedCheckpointFileManager) is based on rename. So when a file is 
> opened for an atomic stream a temporary file used instead and when the stream 
> is committed the file is renamed.
> But on S3 a rename will be a file copy. So it has some serious performance 
> implication.
> But on Hadoop 3 there is new interface introduce called *Abortable* and 
> *S3AFileSystem* has this capability which is implemented by on top S3's 
> multipart upload. So when the file is committed a POST is sent 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html])
>  and when aborted a DELETE will be send 
> ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40212) SparkSQL castPartValue does not properly handle byte & short

2022-08-24 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584613#comment-17584613
 ] 

Yuming Wang commented on SPARK-40212:
-

How to reproduce this issue?

> SparkSQL castPartValue does not properly handle byte & short
> 
>
> Key: SPARK-40212
> URL: https://issues.apache.org/jira/browse/SPARK-40212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Brennan Stein
>Priority: Major
>
> Reading in a parquet file partitioned on disk by a `Byte`-type column fails 
> with the following exception:
>  
> {code:java}
> [info]   Cause: java.lang.ClassCastException: java.lang.Integer cannot be 
> cast to java.lang.Byte
> [info]   at scala.runtime.BoxesRunTime.unboxToByte(BoxesRunTime.java:95)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getByte(rows.scala:39)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getByte$(rows.scala:39)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getByte(rows.scala:195)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.getByte(JoinedRow.scala:86)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_6$(Unknown
>  Source)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> [info]   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$8(ParquetFileFormat.scala:385)
> [info]   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator$$anon$1.next(RecordReaderIterator.scala:62)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:189)
> [info]   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> [info]   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> [info]   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
> [info]   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
> [info]   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
> [info]   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> [info]   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
> [info]   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
> [info]   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> [info]   at org.apache.spark.scheduler.Task.run(Task.scala:136)
> [info]   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
> [info]   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
> [info]   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [info]   at java.lang.Thread.run(Thread.java:748) {code}
> I believe the issue to stem from 
> [PartitioningUtils::castPartValueToDesiredType|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L533]
>  returning an Integer for ByteType and ShortType (which then fails to unbox 
> to the expected type):
>  
> {code:java}
> case ByteType | ShortType | IntegerType => Integer.parseInt(value) {code}
>  
> The issue appears to have been introduced in [this 
> commit|https://github.com/apache/spark/commit/fc29c91f27d866502f5b6cc4261d4943b57e]
>  so likely affects Spark 3.2 as well, though I've only tested on 3.3.0.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40215) Add SQL configs to control CSV/JSON date and timestamp parsing behaviour

2022-08-24 Thread Ivan Sadikov (Jira)
Ivan Sadikov created SPARK-40215:


 Summary: Add SQL configs to control CSV/JSON date and timestamp 
parsing behaviour
 Key: SPARK-40215
 URL: https://issues.apache.org/jira/browse/SPARK-40215
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Ivan Sadikov






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40215) Add SQL configs to control CSV/JSON date and timestamp parsing behaviour

2022-08-24 Thread Ivan Sadikov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584612#comment-17584612
 ] 

Ivan Sadikov commented on SPARK-40215:
--

Follow-up.

> Add SQL configs to control CSV/JSON date and timestamp parsing behaviour
> 
>
> Key: SPARK-40215
> URL: https://issues.apache.org/jira/browse/SPARK-40215
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ivan Sadikov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40213) Incorrect ASCII value for Latin-1 Supplement characters

2022-08-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40213.
--
Fix Version/s: 3.4.0
   3.3.1
   3.2.3
 Assignee: Linhong Liu
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/37651

> Incorrect ASCII value for Latin-1 Supplement characters
> ---
>
> Key: SPARK-40213
> URL: https://issues.apache.org/jira/browse/SPARK-40213
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: Linhong Liu
>Assignee: Linhong Liu
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
>
> the `ascii()` built-in function in spark doesn't support Latin-1 Supplement 
> characters which value between [128, 256). Instead, it produces a wrong 
> value, -62 or -61 for all the chars. But the `chr()` built-in function 
> supports value in [0, 256) and normally `ascii` should be the inverse of 
> `chr()`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40214) Add `get` to dataframe functions

2022-08-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584589#comment-17584589
 ] 

Apache Spark commented on SPARK-40214:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37652

> Add `get` to dataframe functions
> 
>
> Key: SPARK-40214
> URL: https://issues.apache.org/jira/browse/SPARK-40214
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40214) Add `get` to dataframe functions

2022-08-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40214:


Assignee: Apache Spark  (was: Ruifeng Zheng)

> Add `get` to dataframe functions
> 
>
> Key: SPARK-40214
> URL: https://issues.apache.org/jira/browse/SPARK-40214
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40214) Add `get` to dataframe functions

2022-08-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40214:


Assignee: Ruifeng Zheng  (was: Apache Spark)

> Add `get` to dataframe functions
> 
>
> Key: SPARK-40214
> URL: https://issues.apache.org/jira/browse/SPARK-40214
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40214) Add `get` to dataframe functions

2022-08-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584588#comment-17584588
 ] 

Apache Spark commented on SPARK-40214:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37652

> Add `get` to dataframe functions
> 
>
> Key: SPARK-40214
> URL: https://issues.apache.org/jira/browse/SPARK-40214
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40214) Add `get` to dataframe functions

2022-08-24 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40214:
-

Assignee: Ruifeng Zheng

> Add `get` to dataframe functions
> 
>
> Key: SPARK-40214
> URL: https://issues.apache.org/jira/browse/SPARK-40214
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40214) Add `get` to dataframe functions

2022-08-24 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-40214:
-

 Summary: Add `get` to dataframe functions
 Key: SPARK-40214
 URL: https://issues.apache.org/jira/browse/SPARK-40214
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40202) Allow a dictionary in SparkSession.config in PySpark

2022-08-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40202:


Assignee: Hyukjin Kwon

> Allow a dictionary in SparkSession.config in PySpark
> 
>
> Key: SPARK-40202
> URL: https://issues.apache.org/jira/browse/SPARK-40202
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> SPARK-40163 added a new signature in SparkSession.conf. We should better have 
> the same one in PySpark too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40202) Allow a dictionary in SparkSession.config in PySpark

2022-08-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40202.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37642
[https://github.com/apache/spark/pull/37642]

> Allow a dictionary in SparkSession.config in PySpark
> 
>
> Key: SPARK-40202
> URL: https://issues.apache.org/jira/browse/SPARK-40202
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> SPARK-40163 added a new signature in SparkSession.conf. We should better have 
> the same one in PySpark too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40055) listCatalogs should also return spark_catalog even spark_catalog implementation is defaultSessionCatalog

2022-08-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40055.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37488
[https://github.com/apache/spark/pull/37488]

> listCatalogs should also return spark_catalog even spark_catalog 
> implementation is defaultSessionCatalog
> 
>
> Key: SPARK-40055
> URL: https://issues.apache.org/jira/browse/SPARK-40055
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40055) listCatalogs should also return spark_catalog even spark_catalog implementation is defaultSessionCatalog

2022-08-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-40055:
---

Assignee: Rui Wang

> listCatalogs should also return spark_catalog even spark_catalog 
> implementation is defaultSessionCatalog
> 
>
> Key: SPARK-40055
> URL: https://issues.apache.org/jira/browse/SPARK-40055
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40213) Incorrect ASCII value for Latin-1 Supplement characters

2022-08-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40213:


Assignee: (was: Apache Spark)

> Incorrect ASCII value for Latin-1 Supplement characters
> ---
>
> Key: SPARK-40213
> URL: https://issues.apache.org/jira/browse/SPARK-40213
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: Linhong Liu
>Priority: Major
>
> the `ascii()` built-in function in spark doesn't support Latin-1 Supplement 
> characters which value between [128, 256). Instead, it produces a wrong 
> value, -62 or -61 for all the chars. But the `chr()` built-in function 
> supports value in [0, 256) and normally `ascii` should be the inverse of 
> `chr()`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40213) Incorrect ASCII value for Latin-1 Supplement characters

2022-08-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40213:


Assignee: Apache Spark

> Incorrect ASCII value for Latin-1 Supplement characters
> ---
>
> Key: SPARK-40213
> URL: https://issues.apache.org/jira/browse/SPARK-40213
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: Linhong Liu
>Assignee: Apache Spark
>Priority: Major
>
> the `ascii()` built-in function in spark doesn't support Latin-1 Supplement 
> characters which value between [128, 256). Instead, it produces a wrong 
> value, -62 or -61 for all the chars. But the `chr()` built-in function 
> supports value in [0, 256) and normally `ascii` should be the inverse of 
> `chr()`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40213) Incorrect ASCII value for Latin-1 Supplement characters

2022-08-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584529#comment-17584529
 ] 

Apache Spark commented on SPARK-40213:
--

User 'linhongliu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/37651

> Incorrect ASCII value for Latin-1 Supplement characters
> ---
>
> Key: SPARK-40213
> URL: https://issues.apache.org/jira/browse/SPARK-40213
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.2
>Reporter: Linhong Liu
>Priority: Major
>
> the `ascii()` built-in function in spark doesn't support Latin-1 Supplement 
> characters which value between [128, 256). Instead, it produces a wrong 
> value, -62 or -61 for all the chars. But the `chr()` built-in function 
> supports value in [0, 256) and normally `ascii` should be the inverse of 
> `chr()`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40213) Incorrect ASCII value for Latin-1 Supplement characters

2022-08-24 Thread Linhong Liu (Jira)
Linhong Liu created SPARK-40213:
---

 Summary: Incorrect ASCII value for Latin-1 Supplement characters
 Key: SPARK-40213
 URL: https://issues.apache.org/jira/browse/SPARK-40213
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.2.2
Reporter: Linhong Liu


the `ascii()` built-in function in spark doesn't support Latin-1 Supplement 
characters which value between [128, 256). Instead, it produces a wrong value, 
-62 or -61 for all the chars. But the `chr()` built-in function supports value 
in [0, 256) and normally `ascii` should be the inverse of `chr()`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40212) SparkSQL castPartValue does not properly handle byte & short

2022-08-24 Thread Brennan Stein (Jira)
Brennan Stein created SPARK-40212:
-

 Summary: SparkSQL castPartValue does not properly handle byte & 
short
 Key: SPARK-40212
 URL: https://issues.apache.org/jira/browse/SPARK-40212
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Brennan Stein


Reading in a parquet file partitioned on disk by a `Byte`-type column fails 
with the following exception:

 
{code:java}
[info]   Cause: java.lang.ClassCastException: java.lang.Integer cannot be cast 
to java.lang.Byte
[info]   at scala.runtime.BoxesRunTime.unboxToByte(BoxesRunTime.java:95)
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getByte(rows.scala:39)
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getByte$(rows.scala:39)
[info]   at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getByte(rows.scala:195)
[info]   at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getByte(JoinedRow.scala:86)
[info]   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_6$(Unknown
 Source)
[info]   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
[info]   at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$8(ParquetFileFormat.scala:385)
[info]   at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator$$anon$1.next(RecordReaderIterator.scala:62)
[info]   at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:189)
[info]   at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
[info]   at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
[info]   at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[info]   at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
[info]   at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
[info]   at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
[info]   at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
[info]   at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[info]   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
[info]   at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
[info]   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
[info]   at org.apache.spark.scheduler.Task.run(Task.scala:136)
[info]   at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
[info]   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
[info]   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
[info]   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info]   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info]   at java.lang.Thread.run(Thread.java:748) {code}
I believe the issue to stem from 
[PartitioningUtils::castPartValueToDesiredType|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L533]
 returning an Integer for ByteType and ShortType (which then fails to unbox to 
the expected type):

 
{code:java}
case ByteType | ShortType | IntegerType => Integer.parseInt(value) {code}
 

The issue appears to have been introduced in [this 
commit|https://github.com/apache/spark/commit/fc29c91f27d866502f5b6cc4261d4943b57e]
 so likely affects Spark 3.2 as well, though I've only tested on 3.3.0.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40211) Allow executeTake() / collectLimit's number of starting partitions to be customized

2022-08-24 Thread Ziqi Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584458#comment-17584458
 ] 

Ziqi Liu commented on SPARK-40211:
--

I'm actively working on this

> Allow executeTake() / collectLimit's number of starting partitions to be 
> customized
> ---
>
> Key: SPARK-40211
> URL: https://issues.apache.org/jira/browse/SPARK-40211
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Ziqi Liu
>Priority: Major
>
> Today, Spark’s executeTake() code allow for the limitScaleUpFactor to be 
> customized but does not allow for the initial number of partitions to be 
> customized: it’s currently hardcoded to {{{}1{}}}.
> We should add a configuration so that the initial partition count can be 
> customized. By setting this new configuration to a high value we could 
> effectively mitigate the “run multiple jobs” overhead in {{take}} behavior. 
> We could also set it to higher-than-1-but-still-small values (like, say, 
> {{{}10{}}}) to achieve a middle-ground trade-off.
>  
> Essentially, we need to make {{numPartsToTry = 1L}} 
> ([code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L481])
>  customizable. We should do this via a new SQL conf, similar to the 
> {{limitScaleUpFactor}} conf.
>  
> Spark has several near-duplicate versions of this code ([see code 
> search|https://github.com/apache/spark/search?q=numPartsToTry+%3D+1]) in:
>  * SparkPlan
>  * RDD
>  * pyspark rdd
> Also, in pyspark  {{limitScaleUpFactor}}  is not supported either. So for 
> now, I will focus on scala side first, leaving python side untouched and 
> meanwhile sync with pyspark members. Depending on the progress we can do them 
> all in one PR or make scala side change first and leave pyspark change as a 
> follow-up.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40211) Allow executeTake() / collectLimit's number of starting partitions to be customized

2022-08-24 Thread Ziqi Liu (Jira)
Ziqi Liu created SPARK-40211:


 Summary: Allow executeTake() / collectLimit's number of starting 
partitions to be customized
 Key: SPARK-40211
 URL: https://issues.apache.org/jira/browse/SPARK-40211
 Project: Spark
  Issue Type: Story
  Components: Spark Core, SQL
Affects Versions: 3.4.0
Reporter: Ziqi Liu


Today, Spark’s executeTake() code allow for the limitScaleUpFactor to be 
customized but does not allow for the initial number of partitions to be 
customized: it’s currently hardcoded to {{{}1{}}}.

We should add a configuration so that the initial partition count can be 
customized. By setting this new configuration to a high value we could 
effectively mitigate the “run multiple jobs” overhead in {{take}} behavior. We 
could also set it to higher-than-1-but-still-small values (like, say, 
{{{}10{}}}) to achieve a middle-ground trade-off.

 

Essentially, we need to make {{numPartsToTry = 1L}} 
([code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L481])
 customizable. We should do this via a new SQL conf, similar to the 
{{limitScaleUpFactor}} conf.

 

Spark has several near-duplicate versions of this code ([see code 
search|https://github.com/apache/spark/search?q=numPartsToTry+%3D+1]) in:
 * SparkPlan
 * RDD
 * pyspark rdd

Also, in pyspark  {{limitScaleUpFactor}}  is not supported either. So for now, 
I will focus on scala side first, leaving python side untouched and meanwhile 
sync with pyspark members. Depending on the progress we can do them all in one 
PR or make scala side change first and leave pyspark change as a follow-up.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40210) Fix math atan2, hypot, pow and pmod float argument call

2022-08-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584447#comment-17584447
 ] 

Apache Spark commented on SPARK-40210:
--

User 'khalidmammadov' has created a pull request for this issue:
https://github.com/apache/spark/pull/37650

> Fix math atan2, hypot, pow and pmod float argument call
> ---
>
> Key: SPARK-40210
> URL: https://issues.apache.org/jira/browse/SPARK-40210
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Khalid Mammadov
>Priority: Minor
>
> PySpark atan2, hypot, pow and pmod functions marked as accepting float type 
> as argument but produce error when called together



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40210) Fix math atan2, hypot, pow and pmod float argument call

2022-08-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40210:


Assignee: (was: Apache Spark)

> Fix math atan2, hypot, pow and pmod float argument call
> ---
>
> Key: SPARK-40210
> URL: https://issues.apache.org/jira/browse/SPARK-40210
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Khalid Mammadov
>Priority: Minor
>
> PySpark atan2, hypot, pow and pmod functions marked as accepting float type 
> as argument but produce error when called together



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40210) Fix math atan2, hypot, pow and pmod float argument call

2022-08-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40210:


Assignee: Apache Spark

> Fix math atan2, hypot, pow and pmod float argument call
> ---
>
> Key: SPARK-40210
> URL: https://issues.apache.org/jira/browse/SPARK-40210
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Khalid Mammadov
>Assignee: Apache Spark
>Priority: Minor
>
> PySpark atan2, hypot, pow and pmod functions marked as accepting float type 
> as argument but produce error when called together



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40210) Fix math atan2, hypot, pow and pmod float argument call

2022-08-24 Thread Khalid Mammadov (Jira)
Khalid Mammadov created SPARK-40210:
---

 Summary: Fix math atan2, hypot, pow and pmod float argument call
 Key: SPARK-40210
 URL: https://issues.apache.org/jira/browse/SPARK-40210
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Khalid Mammadov


PySpark atan2, hypot, pow and pmod functions marked as accepting float type as 
argument but produce error when called together



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40209) Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE

2022-08-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40209:


Assignee: Max Gekk  (was: Apache Spark)

> Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE
> --
>
> Key: SPARK-40209
> URL: https://issues.apache.org/jira/browse/SPARK-40209
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The example below demonstrates the issue:
> {code:sql}
> spark-sql> select cast(interval '10.123' second as decimal(1, 0));
> [NUMERIC_VALUE_OUT_OF_RANGE] 0.10 cannot be represented as Decimal(1, 0). 
> If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
> {code}
> The value 0.10 is not related to 10.123.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40209) Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE

2022-08-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40209:


Assignee: Apache Spark  (was: Max Gekk)

> Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE
> --
>
> Key: SPARK-40209
> URL: https://issues.apache.org/jira/browse/SPARK-40209
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The example below demonstrates the issue:
> {code:sql}
> spark-sql> select cast(interval '10.123' second as decimal(1, 0));
> [NUMERIC_VALUE_OUT_OF_RANGE] 0.10 cannot be represented as Decimal(1, 0). 
> If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
> {code}
> The value 0.10 is not related to 10.123.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40209) Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE

2022-08-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584436#comment-17584436
 ] 

Apache Spark commented on SPARK-40209:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37649

> Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE
> --
>
> Key: SPARK-40209
> URL: https://issues.apache.org/jira/browse/SPARK-40209
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The example below demonstrates the issue:
> {code:sql}
> spark-sql> select cast(interval '10.123' second as decimal(1, 0));
> [NUMERIC_VALUE_OUT_OF_RANGE] 0.10 cannot be represented as Decimal(1, 0). 
> If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
> {code}
> The value 0.10 is not related to 10.123.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40209) Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE

2022-08-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40209:
-
Description: 
The example below demonstrates the issue:
{code:sql}
spark-sql> select cast(interval '10.123' second as decimal(1, 0));
[NUMERIC_VALUE_OUT_OF_RANGE] 0.10 cannot be represented as Decimal(1, 0). 
If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
{code}

The value 0.10 is not related to 10.123.

  was:
The example below demonstrates the issue:
{code:sql}
spark-sql> select cast(interval '10.123' second as decimal(1, 0));
[NUMERIC_VALUE_OUT_OF_RANGE] 0.10 cannot be represented as Decimal(1, 0). 
If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
{code}



> Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE
> --
>
> Key: SPARK-40209
> URL: https://issues.apache.org/jira/browse/SPARK-40209
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The example below demonstrates the issue:
> {code:sql}
> spark-sql> select cast(interval '10.123' second as decimal(1, 0));
> [NUMERIC_VALUE_OUT_OF_RANGE] 0.10 cannot be represented as Decimal(1, 0). 
> If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
> {code}
> The value 0.10 is not related to 10.123.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40209) Incorrect value in the error message of NUMERIC_VALUE_OUT_OF_RANGE

2022-08-24 Thread Max Gekk (Jira)
Max Gekk created SPARK-40209:


 Summary: Incorrect value in the error message of 
NUMERIC_VALUE_OUT_OF_RANGE
 Key: SPARK-40209
 URL: https://issues.apache.org/jira/browse/SPARK-40209
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk


The example below demonstrates the issue:
{code:sql}
spark-sql> select cast(interval '10.123' second as decimal(1, 0));
[NUMERIC_VALUE_OUT_OF_RANGE] 0.10 cannot be represented as Decimal(1, 0). 
If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40195) Add PrunedScanWithAQESuite

2022-08-24 Thread Kazuyuki Tanimura (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura resolved SPARK-40195.
---
Resolution: Invalid

I just realized the suite is not for AQE, so closing

> Add PrunedScanWithAQESuite
> --
>
> Key: SPARK-40195
> URL: https://issues.apache.org/jira/browse/SPARK-40195
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> Currently `PrunedScanSuite` assumes that AQE is always not applied. We should 
> also test with AQE force applied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40094) Send TaskEnd event when task failed with NotSerializableException or TaskOutputFileAlreadyExistException to release executors for dynamic allocation

2022-08-24 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-40094.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37528
[https://github.com/apache/spark/pull/37528]

>  Send TaskEnd event when task failed with NotSerializableException or 
> TaskOutputFileAlreadyExistException to release executors for dynamic 
> allocation 
> --
>
> Key: SPARK-40094
> URL: https://issues.apache.org/jira/browse/SPARK-40094
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: wangshengjie
>Assignee: wangshengjie
>Priority: Major
> Fix For: 3.4.0
>
>
> We found if task failed with NotSerializableException or 
> TaskOutputFileAlreadyExistException, wont send TaskEnd event, and this will 
> cause dynamic allocation not release executor normally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40094) Send TaskEnd event when task failed with NotSerializableException or TaskOutputFileAlreadyExistException to release executors for dynamic allocation

2022-08-24 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-40094:
---

Assignee: wangshengjie

>  Send TaskEnd event when task failed with NotSerializableException or 
> TaskOutputFileAlreadyExistException to release executors for dynamic 
> allocation 
> --
>
> Key: SPARK-40094
> URL: https://issues.apache.org/jira/browse/SPARK-40094
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: wangshengjie
>Assignee: wangshengjie
>Priority: Major
>
> We found if task failed with NotSerializableException or 
> TaskOutputFileAlreadyExistException, wont send TaskEnd event, and this will 
> cause dynamic allocation not release executor normally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40208) New OFFSET clause does not use new error framework

2022-08-24 Thread Serge Rielau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584351#comment-17584351
 ] 

Serge Rielau commented on SPARK-40208:
--

[~maxgekk] (FYI)
Also (I'm sure LIMIT is the same, maybe fix in one fell swoop?)

spark-sql> SELECT name, age FROM person ORDER BY name OFFSET -1;

Error in query: The offset expression must be equal to or greater than 0, but 
got -1;

Offset -1

+- Sort [name#185 ASC NULLS FIRST], true

   +- Project [name#185, age#186]

      +- SubqueryAlias person

         +- View (`person`, [name#185,age#186])

            +- Project [cast(col1#187 as string) AS name#185, cast(col2#188 as 
int) AS age#186]

               +- LocalRelation [col1#187, col2#188]

> New OFFSET clause does not use new error framework 
> ---
>
> Key: SPARK-40208
> URL: https://issues.apache.org/jira/browse/SPARK-40208
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Minor
>
> CREATE TEMP VIEW person (name, age)
> AS VALUES ('Zen Hui', 25),
> ('Anil B' , 18),
> ('Shone S', 16),
> ('Mike A' , 25),
> ('John A' , 18),
> ('Jack N' , 16);
> SELECT name, age FROM person ORDER BY name OFFSET length(name);
> Error in query: The offset expression must evaluate to a constant value, but 
> got length(person.name);
> Offset length(name#181)
> +- Sort [name#181 ASC NULLS FIRST], true
>    +- Project [name#181, age#182]
>       +- SubqueryAlias person
>          +- View (`person`, [name#181,age#182])
>             +- Project [cast(col1#183 as string) AS name#181, cast(col2#184 
> as int) AS age#182]
>                +- LocalRelation [col1#183, col2#184|#183, col2#184]
>  
> Returning the plan here is quite pointless as well. The context would be more 
> interesting.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40208) New OFFSET clause does not use new error framework

2022-08-24 Thread Serge Rielau (Jira)
Serge Rielau created SPARK-40208:


 Summary: New OFFSET clause does not use new error framework 
 Key: SPARK-40208
 URL: https://issues.apache.org/jira/browse/SPARK-40208
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Serge Rielau


CREATE TEMP VIEW person (name, age)
AS VALUES ('Zen Hui', 25),
('Anil B' , 18),
('Shone S', 16),
('Mike A' , 25),
('John A' , 18),
('Jack N' , 16);

SELECT name, age FROM person ORDER BY name OFFSET length(name);

Error in query: The offset expression must evaluate to a constant value, but 
got length(person.name);

Offset length(name#181)

+- Sort [name#181 ASC NULLS FIRST], true

   +- Project [name#181, age#182]

      +- SubqueryAlias person

         +- View (`person`, [name#181,age#182])

            +- Project [cast(col1#183 as string) AS name#181, cast(col2#184 as 
int) AS age#182]

               +- LocalRelation [col1#183, col2#184|#183, col2#184]


 

Returning the plan here is quite pointless as well. The context would be more 
interesting.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40207) Specify the column name when the data type is not supported by datasource

2022-08-24 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40207:

Fix Version/s: (was: 3.4.0)

> Specify the column name when the data type is not supported by datasource
> -
>
> Key: SPARK-40207
> URL: https://issues.apache.org/jira/browse/SPARK-40207
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yi kaifei
>Assignee: Apache Spark
>Priority: Major
>
> Currently, If the data type is not supported by the data source, the 
> exception message thrown does not contain the column name, which is less 
> clear for locating the problem, this Jira aims to optimize error message 
> description



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40207) Specify the column name when the data type is not supported by datasource

2022-08-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40207:


Assignee: Apache Spark

> Specify the column name when the data type is not supported by datasource
> -
>
> Key: SPARK-40207
> URL: https://issues.apache.org/jira/browse/SPARK-40207
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yi kaifei
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, If the data type is not supported by the data source, the 
> exception message thrown does not contain the column name, which is less 
> clear for locating the problem, this Jira aims to optimize error message 
> description



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40207) Specify the column name when the data type is not supported by datasource

2022-08-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40207:


Assignee: Apache Spark

> Specify the column name when the data type is not supported by datasource
> -
>
> Key: SPARK-40207
> URL: https://issues.apache.org/jira/browse/SPARK-40207
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yi kaifei
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, If the data type is not supported by the data source, the 
> exception message thrown does not contain the column name, which is less 
> clear for locating the problem, this Jira aims to optimize error message 
> description



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40207) Specify the column name when the data type is not supported by datasource

2022-08-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40207:


Assignee: (was: Apache Spark)

> Specify the column name when the data type is not supported by datasource
> -
>
> Key: SPARK-40207
> URL: https://issues.apache.org/jira/browse/SPARK-40207
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yi kaifei
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, If the data type is not supported by the data source, the 
> exception message thrown does not contain the column name, which is less 
> clear for locating the problem, this Jira aims to optimize error message 
> description



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40207) Specify the column name when the data type is not supported by datasource

2022-08-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584268#comment-17584268
 ] 

Apache Spark commented on SPARK-40207:
--

User 'Yikf' has created a pull request for this issue:
https://github.com/apache/spark/pull/37574

> Specify the column name when the data type is not supported by datasource
> -
>
> Key: SPARK-40207
> URL: https://issues.apache.org/jira/browse/SPARK-40207
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yi kaifei
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, If the data type is not supported by the data source, the 
> exception message thrown does not contain the column name, which is less 
> clear for locating the problem, this Jira aims to optimize error message 
> description



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40207) Specify the column name when the data type is not supported by datasource

2022-08-24 Thread Yi kaifei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi kaifei updated SPARK-40207:
--
Description: Currently, If the data type is not supported by the data 
source, the exception message thrown does not contain the column name, which is 
less clear for locating the problem, this Jira aims to optimize error message 
description

> Specify the column name when the data type is not supported by datasource
> -
>
> Key: SPARK-40207
> URL: https://issues.apache.org/jira/browse/SPARK-40207
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yi kaifei
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, If the data type is not supported by the data source, the 
> exception message thrown does not contain the column name, which is less 
> clear for locating the problem, this Jira aims to optimize error message 
> description



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40207) Specify the column name when the data type is not supported by datasource

2022-08-24 Thread Yi kaifei (Jira)
Yi kaifei created SPARK-40207:
-

 Summary: Specify the column name when the data type is not 
supported by datasource
 Key: SPARK-40207
 URL: https://issues.apache.org/jira/browse/SPARK-40207
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yi kaifei
 Fix For: 3.4.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38909) Encapsulate LevelDB used by ExternalShuffleBlockResolver and YarnShuffleService as LocalDB

2022-08-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584247#comment-17584247
 ] 

Apache Spark commented on SPARK-38909:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37648

> Encapsulate LevelDB used by ExternalShuffleBlockResolver and 
> YarnShuffleService as LocalDB
> --
>
> Key: SPARK-38909
> URL: https://issues.apache.org/jira/browse/SPARK-38909
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> {{ExternalShuffleBlockResolver}} and {{YarnShuffleService}}  use {{{}LevelDB 
> directly{}}}, this is not conducive to extending the use of {{RocksDB}} in 
> this scenario. This pr is encapsulated for expansibility. It will be the 
> pre-work of SPARK-3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38909) Encapsulate LevelDB used by ExternalShuffleBlockResolver and YarnShuffleService as LocalDB

2022-08-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584245#comment-17584245
 ] 

Apache Spark commented on SPARK-38909:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37648

> Encapsulate LevelDB used by ExternalShuffleBlockResolver and 
> YarnShuffleService as LocalDB
> --
>
> Key: SPARK-38909
> URL: https://issues.apache.org/jira/browse/SPARK-38909
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>
> {{ExternalShuffleBlockResolver}} and {{YarnShuffleService}}  use {{{}LevelDB 
> directly{}}}, this is not conducive to extending the use of {{RocksDB}} in 
> this scenario. This pr is encapsulated for expansibility. It will be the 
> pre-work of SPARK-3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39957) Delay onDisconnected to enable Driver receives ExecutorExitCode

2022-08-24 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi reassigned SPARK-39957:


Assignee: Kai-Hsun Chen

> Delay onDisconnected to enable Driver receives ExecutorExitCode
> ---
>
> Key: SPARK-39957
> URL: https://issues.apache.org/jira/browse/SPARK-39957
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Kai-Hsun Chen
>Assignee: Kai-Hsun Chen
>Priority: Major
>
> There are two methods to detect executor loss. First, when RPC fails, the 
> function {{onDisconnected}} will be triggered. Second, when executor exits 
> with ExecutorExitCode, the exit code will be passed from ExecutorRunner to 
> Driver. These two methods may categorize same cases into different 
> conclusions. We hope to categorize the ExecutorLossReason by 
> ExecutorExitCode. This PR aims to make sure Driver receives ExecutorExitCode 
> before onDisconnected is called.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39957) Delay onDisconnected to enable Driver receives ExecutorExitCode

2022-08-24 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi resolved SPARK-39957.
--
Resolution: Fixed

Issue resolved by https://github.com/apache/spark/pull/37400

> Delay onDisconnected to enable Driver receives ExecutorExitCode
> ---
>
> Key: SPARK-39957
> URL: https://issues.apache.org/jira/browse/SPARK-39957
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Kai-Hsun Chen
>Assignee: Kai-Hsun Chen
>Priority: Major
>
> There are two methods to detect executor loss. First, when RPC fails, the 
> function {{onDisconnected}} will be triggered. Second, when executor exits 
> with ExecutorExitCode, the exit code will be passed from ExecutorRunner to 
> Driver. These two methods may categorize same cases into different 
> conclusions. We hope to categorize the ExecutorLossReason by 
> ExecutorExitCode. This PR aims to make sure Driver receives ExecutorExitCode 
> before onDisconnected is called.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39957) Delay onDisconnected to enable Driver receives ExecutorExitCode

2022-08-24 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-39957:
-
Fix Version/s: 3.4.0

> Delay onDisconnected to enable Driver receives ExecutorExitCode
> ---
>
> Key: SPARK-39957
> URL: https://issues.apache.org/jira/browse/SPARK-39957
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Kai-Hsun Chen
>Assignee: Kai-Hsun Chen
>Priority: Major
> Fix For: 3.4.0
>
>
> There are two methods to detect executor loss. First, when RPC fails, the 
> function {{onDisconnected}} will be triggered. Second, when executor exits 
> with ExecutorExitCode, the exit code will be passed from ExecutorRunner to 
> Driver. These two methods may categorize same cases into different 
> conclusions. We hope to categorize the ExecutorLossReason by 
> ExecutorExitCode. This PR aims to make sure Driver receives ExecutorExitCode 
> before onDisconnected is called.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38752) Test the error class: UNSUPPORTED_DATATYPE

2022-08-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38752:


Assignee: lvshaokang

> Test the error class: UNSUPPORTED_DATATYPE
> --
>
> Key: SPARK-38752
> URL: https://issues.apache.org/jira/browse/SPARK-38752
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: lvshaokang
>Priority: Minor
>  Labels: starter
>
> Add a test for the error classes *UNSUPPORTED_DATATYPE* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def dataTypeUnsupportedError(dataType: String, failure: String): Throwable 
> = {
> new SparkIllegalArgumentException(errorClass = "UNSUPPORTED_DATATYPE",
>   messageParameters = Array(dataType + failure))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38752) Test the error class: UNSUPPORTED_DATATYPE

2022-08-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38752.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37640
[https://github.com/apache/spark/pull/37640]

> Test the error class: UNSUPPORTED_DATATYPE
> --
>
> Key: SPARK-38752
> URL: https://issues.apache.org/jira/browse/SPARK-38752
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: lvshaokang
>Priority: Minor
>  Labels: starter
> Fix For: 3.4.0
>
>
> Add a test for the error classes *UNSUPPORTED_DATATYPE* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def dataTypeUnsupportedError(dataType: String, failure: String): Throwable 
> = {
> new SparkIllegalArgumentException(errorClass = "UNSUPPORTED_DATATYPE",
>   messageParameters = Array(dataType + failure))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40203) Add test cases for Spark Decimal

2022-08-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40203.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37644
[https://github.com/apache/spark/pull/37644]

> Add test cases for Spark Decimal
> 
>
> Key: SPARK-40203
> URL: https://issues.apache.org/jira/browse/SPARK-40203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>
> Spark Decimal have a lot of method without unit tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40203) Add test cases for Spark Decimal

2022-08-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-40203:


Assignee: jiaan.geng

> Add test cases for Spark Decimal
> 
>
> Key: SPARK-40203
> URL: https://issues.apache.org/jira/browse/SPARK-40203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> Spark Decimal have a lot of method without unit tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-39791) In Spark 3.0 standalone cluster mode, unable to customize driver JVM path

2022-08-24 Thread Obobj (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570717#comment-17570717
 ] 

Obobj edited comment on SPARK-39791 at 8/24/22 10:49 AM:
-

[~hyukjin.kwon] Thanks Reply.

In standalone mode, javaHome always empty, so always take 
childEnv.get("JAVA_HOME")


was (Author: JIRAUSER292149):
In standalone mode, javaHome always empty, so always take 
childEnv.get("JAVA_HOME")

> In Spark 3.0 standalone cluster mode, unable to customize driver JVM path
> -
>
> Key: SPARK-39791
> URL: https://issues.apache.org/jira/browse/SPARK-39791
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit
>Affects Versions: 3.0.0
>Reporter: Obobj
>Priority: Minor
>  Labels: spark-submit, standalone
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> In Spark 3.0 standalone mode, unable to customize driver JVM path, instead 
> the JAVA_HOME of the spark-submit submission machine is used, but the JVM 
> paths of my submission machine and the cluster machine are different
> {code:java}
> launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java
> List buildJavaCommand(String extraClassPath) throws IOException {
>   List cmd = new ArrayList<>();
>   String firstJavaHome = firstNonEmpty(javaHome,
> childEnv.get("JAVA_HOME"),
> System.getenv("JAVA_HOME"),
> System.getProperty("java.home")); {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40206) Spark SQL Predict Pushdown for Hive Bucketed Table

2022-08-24 Thread Raymond Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Tang updated SPARK-40206:
-
Labels: hive hive-buckets spark spark-sql  (was: hive hive-buckets spark)

> Spark SQL Predict Pushdown for Hive Bucketed Table
> --
>
> Key: SPARK-40206
> URL: https://issues.apache.org/jira/browse/SPARK-40206
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Raymond Tang
>Priority: Minor
>  Labels: hive, hive-buckets, spark, spark-sql
>
> Hi team,
> I was testing out Hive bucket table features.  One of the benefits as most 
> documentation suggested is that bucketed hive table can be used for query 
> filer/predict pushdown to improve query performance.
> However through my exploration, that doesn't seem to be true. *Can you please 
> help to clarify if Spark SQL supports query optimizations when using Hive 
> bucketed table?*
>  
> How to produce the issue:
> Create a Hive 3 table using the following DDL:
> {code:java}
> create table test_db.bucket_table(user_id int, key string) 
> comment 'A bucketed table' 
> partitioned by(country string) 
> clustered by(user_id) sorted by (key) into 10 buckets
> stored as ORC;{code}
> And then insert into this table using the following PySpark script:
> {code:java}
> from pyspark.sql import SparkSession
> appName = "PySpark Hive Bucketing Example"
> master = "local"
> # Create Spark session with Hive supported.
> spark = SparkSession.builder \
> .appName(appName) \
> .master(master) \
> .enableHiveSupport() \
> .getOrCreate()
> # prepare sample data for inserting into hive table
> data = []
> countries = ['CN', 'AU']
> for i in range(0, 1000):
> data.append([int(i),  'U'+str(i), countries[i % 2]])
> df = spark.createDataFrame(data, ['user_id', 'key', 'country'])
> df.show()
> # Save df to Hive table test_db.bucket_table
> df.write.mode('append').insertInto('test_db.bucket_table') {code}
> Then query the table using the following script:
> {code:java}
> from pyspark.sql import SparkSession
> appName = "PySpark Hive Bucketing Example"
> master = "local"
> # Create Spark session with Hive supported.
> spark = SparkSession.builder \
> .appName(appName) \
> .master(master) \
> .enableHiveSupport() \
> .getOrCreate()
> df = spark.sql("""select * from test_db.bucket_table
> where country='AU' and user_id=101
> """)
> df.show()
> df.explain(extended=True) {code}
> I am expecting to read from only one bucket file in HDFS but instead Spark 
> scanned all bucket files in partition folder country=AU.
> {code:java}
> == Parsed Logical Plan ==
> 'Project [*]
>  - 'Filter (('country = AU) AND ('t1.user_id = 101))
> - 'SubqueryAlias t1
>- 'UnresolvedRelation [test_db, bucket_table], [], false
> == Analyzed Logical Plan ==
> user_id: int, key: string, country: string
> Project [user_id#20, key#21, country#22]
>  - Filter ((country#22 = AU) AND (user_id#20 = 101))
> - SubqueryAlias t1
>- SubqueryAlias spark_catalog.test_db.bucket_table
>   - Relation test_db.bucket_table[user_id#20,key#21,country#22] orc
> == Optimized Logical Plan ==
> Filter (((isnotnull(country#22) AND isnotnull(user_id#20)) AND (country#22 = 
> AU)) AND (user_id#20 = 101))
>  - Relation test_db.bucket_table[user_id#20,key#21,country#22] orc
> == Physical Plan ==
> *(1) Filter (isnotnull(user_id#20) AND (user_id#20 = 101))
>  - *(1) ColumnarToRow
> - FileScan orc test_db.bucket_table[user_id#20,key#21,country#22] 
> Batched: true, DataFilters: [isnotnull(user_id#20), (user_id#20 = 101)], 
> Format: ORC, Location: InMemoryFileIndex(1 
> paths)[hdfs://localhost:9000/user/hive/warehouse/test_db.db/bucket_table/coun...,
>  PartitionFilters: [isnotnull(country#22), (country#22 = AU)], PushedFilters: 
> [IsNotNull(user_id), EqualTo(user_id,101)], ReadSchema: 
> struct   {code}
> *Am I doing something wrong? or is it because Spark doesn't support it? Your 
> guidance and help will be appreciated.* 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40206) Spark SQL Predict Pushdown for Hive Bucketed Table

2022-08-24 Thread Raymond Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Tang updated SPARK-40206:
-
Description: 
Hi team,

I was testing out Hive bucket table features.  One of the benefits as most 
documentation suggested is that bucketed hive table can be used for query 
filer/predict pushdown to improve query performance.

However through my exploration, that doesn't seem to be true. *Can you please 
help to clarify if Spark SQL supports query optimizations when using Hive 
bucketed table?*

 

How to produce the issue:

Create a Hive 3 table using the following DDL:
{code:java}
create table test_db.bucket_table(user_id int, key string) 
comment 'A bucketed table' 
partitioned by(country string) 
clustered by(user_id) sorted by (key) into 10 buckets
stored as ORC;{code}
And then insert into this table using the following PySpark script:
{code:java}
from pyspark.sql import SparkSession

appName = "PySpark Hive Bucketing Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.getOrCreate()

# prepare sample data for inserting into hive table
data = []
countries = ['CN', 'AU']
for i in range(0, 1000):
data.append([int(i),  'U'+str(i), countries[i % 2]])

df = spark.createDataFrame(data, ['user_id', 'key', 'country'])
df.show()

# Save df to Hive table test_db.bucket_table

df.write.mode('append').insertInto('test_db.bucket_table') {code}
Then query the table using the following script:
{code:java}
from pyspark.sql import SparkSession

appName = "PySpark Hive Bucketing Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.getOrCreate()

df = spark.sql("""select * from test_db.bucket_table
where country='AU' and user_id=101
""")
df.show()
df.explain(extended=True) {code}
I am expecting to read from only one bucket file in HDFS but instead Spark 
scanned all bucket files in partition folder country=AU.
{code:java}
== Parsed Logical Plan ==
'Project [*]
 - 'Filter (('country = AU) AND ('t1.user_id = 101))
- 'SubqueryAlias t1
   - 'UnresolvedRelation [test_db, bucket_table], [], false

== Analyzed Logical Plan ==
user_id: int, key: string, country: string
Project [user_id#20, key#21, country#22]
 - Filter ((country#22 = AU) AND (user_id#20 = 101))
- SubqueryAlias t1
   - SubqueryAlias spark_catalog.test_db.bucket_table
  - Relation test_db.bucket_table[user_id#20,key#21,country#22] orc

== Optimized Logical Plan ==
Filter (((isnotnull(country#22) AND isnotnull(user_id#20)) AND (country#22 = 
AU)) AND (user_id#20 = 101))
 - Relation test_db.bucket_table[user_id#20,key#21,country#22] orc

== Physical Plan ==
*(1) Filter (isnotnull(user_id#20) AND (user_id#20 = 101))
 - *(1) ColumnarToRow
- FileScan orc test_db.bucket_table[user_id#20,key#21,country#22] Batched: 
true, DataFilters: [isnotnull(user_id#20), (user_id#20 = 101)], Format: ORC, 
Location: InMemoryFileIndex(1 
paths)[hdfs://localhost:9000/user/hive/warehouse/test_db.db/bucket_table/coun...,
 PartitionFilters: [isnotnull(country#22), (country#22 = AU)], PushedFilters: 
[IsNotNull(user_id), EqualTo(user_id,101)], ReadSchema: 
struct   {code}
*Am I doing something wrong? or is it because Spark doesn't support it? Your 
guidance and help will be appreciated.* 

 

  was:
Hi team,

I was testing out Hive bucket table features.  One of the benefits as most 
documentation suggested is that bucketed hive table can be used for query 
filer/predict pushdown to improve query performance.

However through my exploration, that doesn't seem to be true. *Can you please 
help to clarify if Spark SQL supports query optimizations when using Hive 
bucketed table?*

 

How to produce the issue:

Create a Hive 3 table using the following DDL:
{code:java}
create table test_db.bucket_table(user_id int, key string) 
comment 'A bucketed table' 
partitioned by(country string) 
clustered by(user_id) sorted by (key) into 10 buckets
stored as ORC;{code}
And then insert into this table using the following PySpark script:
{code:java}
from pyspark.sql import SparkSession

appName = "PySpark Hive Bucketing Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.getOrCreate()

# prepare sample data for inserting into hive table
data = []
countries = ['CN', 'AU']
for i in range(0, 1000):
data.append([int(i),  'U'+str(i), countries[i % 2]])

df = spark.createDataFrame(data, ['user_id', 'key', 'country'])
df.show()

# Save df to Hive table test_db.bucket_table

df.write.mode('append').insertInto('test_db.bucket_table') {code}
Then query the table using the following script:
{code:java}

[jira] [Updated] (SPARK-40206) Spark SQL Predict Pushdown for Hive Bucketed Table

2022-08-24 Thread Raymond Tang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Tang updated SPARK-40206:
-
Description: 
Hi team,

I was testing out Hive bucket table features.  One of the benefits as most 
documentation suggested is that bucketed hive table can be used for query 
filer/predict pushdown to improve query performance.

However through my exploration, that doesn't seem to be true. *Can you please 
help to clarify if Spark SQL supports query optimizations when using Hive 
bucketed table?*

 

How to produce the issue:

Create a Hive 3 table using the following DDL:
{code:java}
create table test_db.bucket_table(user_id int, key string) 
comment 'A bucketed table' 
partitioned by(country string) 
clustered by(user_id) sorted by (key) into 10 buckets
stored as ORC;{code}
And then insert into this table using the following PySpark script:
{code:java}
from pyspark.sql import SparkSession

appName = "PySpark Hive Bucketing Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.getOrCreate()

# prepare sample data for inserting into hive table
data = []
countries = ['CN', 'AU']
for i in range(0, 1000):
data.append([int(i),  'U'+str(i), countries[i % 2]])

df = spark.createDataFrame(data, ['user_id', 'key', 'country'])
df.show()

# Save df to Hive table test_db.bucket_table

df.write.mode('append').insertInto('test_db.bucket_table') {code}
Then query the table using the following script:
{code:java}
from pyspark.sql import SparkSession

appName = "PySpark Hive Bucketing Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.getOrCreate()

df = spark.sql("""select * from test_db.bucket_table
where country='AU' and user_id=101
""")
df.show()
df.explain(extended=True) {code}
I am expecting to read from only one bucket file in HDFS but instead Spark 
scanned all bucket files in partition folder country=AU.

Am I doing something wrong? or is it because Spark doesn't support it? Your 
guidance and help will be appreciated. 

 

  was:
Hi team,

I was testing out Hive bucket table features.  One of the benefits as most 
documentation suggested is that bucketed hive table can be used for query 
filer/predict pushdown to improve query performance.

However through my exploration, that doesn't seem to be true. *Can you please 
help to clarify if Spark SQL supports query optimizations when using Hive 
bucketed table?*

 

How to produce the issue:

Create a Hive 3 table using the following DDL:
{code:java}
create table test_db.bucket_table(user_id int, key string) 
comment 'A bucketed table' 
partitioned by(country string) 
clustered by(user_id) sorted by (key) into 10 buckets
stored as ORC;{code}
And then insert into this table using the following PySpark script:
{code:java}
from pyspark.sql import SparkSession

appName = "PySpark Hive Bucketing Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.getOrCreate()

# prepare sample data for inserting into hive table
data = []
countries = ['CN', 'AU']
for i in range(0, 1000):
data.append([int(i),  'U'+str(i), countries[i % 2]])

df = spark.createDataFrame(data, ['country', 'user_id', 'key'])
df.show()

# Save df to Hive table test_db.bucket_table

df.write.mode('append').insertInto('test_db.bucket_table') {code}
Then query the table using the following script:
{code:java}
from pyspark.sql import SparkSession

appName = "PySpark Hive Bucketing Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.getOrCreate()

df = spark.sql("""select * from test_db.bucket_table
where country='AU' and user_id=101
""")
df.show()
df.explain(extended=True) {code}
I am expecting to read from only one bucket file in HDFS but instead Spark 
scanned all bucket files in partition folder country=AU.

Am I doing something wrong? or is it because Spark doesn't support it? Your 
guidance and help will be appreciated. 

 


> Spark SQL Predict Pushdown for Hive Bucketed Table
> --
>
> Key: SPARK-40206
> URL: https://issues.apache.org/jira/browse/SPARK-40206
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Raymond Tang
>Priority: Minor
>  Labels: hive, hive-buckets, spark
>
> Hi team,
> I was testing out Hive bucket table features.  One of the benefits as most 
> documentation suggested is that 

[jira] [Created] (SPARK-40206) Spark SQL Predict Pushdown for Hive Bucketed Table

2022-08-24 Thread Raymond Tang (Jira)
Raymond Tang created SPARK-40206:


 Summary: Spark SQL Predict Pushdown for Hive Bucketed Table
 Key: SPARK-40206
 URL: https://issues.apache.org/jira/browse/SPARK-40206
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Raymond Tang


Hi team,

I was testing out Hive bucket table features.  One of the benefits as most 
documentation suggested is that bucketed hive table can be used for query 
filer/predict pushdown to improve query performance.

However through my exploration, that doesn't seem to be true. *Can you please 
help to clarify if Spark SQL supports query optimizations when using Hive 
bucketed table?*

 

How to produce the issue:

Create a Hive 3 table using the following DDL:
{code:java}
create table test_db.bucket_table(user_id int, key string) 
comment 'A bucketed table' 
partitioned by(country string) 
clustered by(user_id) sorted by (key) into 10 buckets
stored as ORC;{code}
And then insert into this table using the following PySpark script:
{code:java}
from pyspark.sql import SparkSession

appName = "PySpark Hive Bucketing Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.getOrCreate()

# prepare sample data for inserting into hive table
data = []
countries = ['CN', 'AU']
for i in range(0, 1000):
data.append([int(i),  'U'+str(i), countries[i % 2]])

df = spark.createDataFrame(data, ['country', 'user_id', 'key'])
df.show()

# Save df to Hive table test_db.bucket_table

df.write.mode('append').insertInto('test_db.bucket_table') {code}
Then query the table using the following script:
{code:java}
from pyspark.sql import SparkSession

appName = "PySpark Hive Bucketing Example"
master = "local"

# Create Spark session with Hive supported.
spark = SparkSession.builder \
.appName(appName) \
.master(master) \
.enableHiveSupport() \
.getOrCreate()

df = spark.sql("""select * from test_db.bucket_table
where country='AU' and user_id=101
""")
df.show()
df.explain(extended=True) {code}
I am expecting to read from only one bucket file in HDFS but instead Spark 
scanned all bucket files in partition folder country=AU.

Am I doing something wrong? or is it because Spark doesn't support it? Your 
guidance and help will be appreciated. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40205) Provide a query context of ELEMENT_AT_BY_INDEX_ZERO

2022-08-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40205:


Assignee: Max Gekk  (was: Apache Spark)

> Provide a query context of ELEMENT_AT_BY_INDEX_ZERO
> ---
>
> Key: SPARK-40205
> URL: https://issues.apache.org/jira/browse/SPARK-40205
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Pass a query context to elementAtByIndexZeroError() in ElementAt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40205) Provide a query context of ELEMENT_AT_BY_INDEX_ZERO

2022-08-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584118#comment-17584118
 ] 

Apache Spark commented on SPARK-40205:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37645

> Provide a query context of ELEMENT_AT_BY_INDEX_ZERO
> ---
>
> Key: SPARK-40205
> URL: https://issues.apache.org/jira/browse/SPARK-40205
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Pass a query context to elementAtByIndexZeroError() in ElementAt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40205) Provide a query context of ELEMENT_AT_BY_INDEX_ZERO

2022-08-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40205:


Assignee: Apache Spark  (was: Max Gekk)

> Provide a query context of ELEMENT_AT_BY_INDEX_ZERO
> ---
>
> Key: SPARK-40205
> URL: https://issues.apache.org/jira/browse/SPARK-40205
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Pass a query context to elementAtByIndexZeroError() in ElementAt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40205) Provide a query context of ELEMENT_AT_BY_INDEX_ZERO

2022-08-24 Thread Max Gekk (Jira)
Max Gekk created SPARK-40205:


 Summary: Provide a query context of ELEMENT_AT_BY_INDEX_ZERO
 Key: SPARK-40205
 URL: https://issues.apache.org/jira/browse/SPARK-40205
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk


Pass a query context to elementAtByIndexZeroError() in ElementAt



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40203) Add test cases for Spark Decimal

2022-08-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584082#comment-17584082
 ] 

Apache Spark commented on SPARK-40203:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/37644

> Add test cases for Spark Decimal
> 
>
> Key: SPARK-40203
> URL: https://issues.apache.org/jira/browse/SPARK-40203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark Decimal have a lot of method without unit tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40203) Add test cases for Spark Decimal

2022-08-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40203:


Assignee: Apache Spark

> Add test cases for Spark Decimal
> 
>
> Key: SPARK-40203
> URL: https://issues.apache.org/jira/browse/SPARK-40203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Spark Decimal have a lot of method without unit tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40203) Add test cases for Spark Decimal

2022-08-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40203:


Assignee: (was: Apache Spark)

> Add test cases for Spark Decimal
> 
>
> Key: SPARK-40203
> URL: https://issues.apache.org/jira/browse/SPARK-40203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark Decimal have a lot of method without unit tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40203) Add test cases for Spark Decimal

2022-08-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584085#comment-17584085
 ] 

Apache Spark commented on SPARK-40203:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/37644

> Add test cases for Spark Decimal
> 
>
> Key: SPARK-40203
> URL: https://issues.apache.org/jira/browse/SPARK-40203
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark Decimal have a lot of method without unit tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40203) Add test cases for Spark Decimal

2022-08-24 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-40203:
--

 Summary: Add test cases for Spark Decimal
 Key: SPARK-40203
 URL: https://issues.apache.org/jira/browse/SPARK-40203
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.4.0
Reporter: jiaan.geng


Spark Decimal have a lot of method without unit tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40204) Whether it is possible to support querying the status of a specific application in a subsequent version

2022-08-24 Thread bitao (Jira)
bitao created SPARK-40204:
-

 Summary: Whether it is possible to support querying the status of 
a specific application in a subsequent version
 Key: SPARK-40204
 URL: https://issues.apache.org/jira/browse/SPARK-40204
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.6, 2.4.4
 Environment: Standalone Cluster Mode
Reporter: bitao


The current SparkAppHandler cannot support obtaining the application status in 
Standalone Cluster mode. One way is to query the status of the specified Driver 
through the StandaloneRestServer, but it cannot query the status of the 
specified application. Is it possible to add a method (eg: handleAppStatus) to 
the StandaloneRestServer by asking the Master Send the RequestMasterState 
message to get the state of the specified application. The current MasterWebUI 
should do this, but the premise is that it needs to use the same RpcEnv as the 
Master Endpoint. Many times we care about the status of the application rather 
than the status of the Driver, so we hope to add this function in subsequent 
versions to support obtaining the status of the specified application in 
Standalone cluster mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40180) Format error messages by spark-sql

2022-08-24 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40180.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37590
[https://github.com/apache/spark/pull/37590]

> Format error messages by spark-sql
> --
>
> Key: SPARK-40180
> URL: https://issues.apache.org/jira/browse/SPARK-40180
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Respect the SQL config spark.sql.error.messageFormat in the implementation of 
> the SQL CLI: spark-sql.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40176) Enhance collapse window optimization to work in case partition or order by keys are expressions

2022-08-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40176:
-
Fix Version/s: (was: 3.3.1)

> Enhance collapse window optimization to work in case partition or order by 
> keys are expressions
> ---
>
> Key: SPARK-40176
> URL: https://issues.apache.org/jira/browse/SPARK-40176
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: Ayushi Agarwal
>Priority: Major
>
> In window operator with multiple window functions, if any expression is 
> present in partition by or sort order columns, windows are not collapsed even 
> if partition and order by expression is same for all those window functions.
> E.g. query:
> val w = 
> Window.{_}partitionBy{_}("key").orderBy({_}lower{_}({_}col{_}("value")))
> df.select({_}lead{_}("key", 1).over(w), {_}lead{_}("value", 1).over(w))
> Current Plan:
> -Window(lead(value,1), key, _w1) -- W1
> - Sort (key, _w1)
> -Project (lower(“value”) as _w1) - P1
> -Window(lead(key,1), key, _w0)  W2
> -Sort(key, _w0)
> -Exchange(key)
> -Project (lower(“value”) as _w0)  P2
> -Scan
>  
> W1 and W2 can be merged in single window



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40181) DataFrame.intersect and .intersectAll are inconsistently dropping rows

2022-08-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40181:
-
Component/s: SQL

> DataFrame.intersect and .intersectAll are inconsistently dropping rows
> --
>
> Key: SPARK-40181
> URL: https://issues.apache.org/jira/browse/SPARK-40181
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.1
>Reporter: Luke
>Priority: Major
>
> I don't have a minimal reproducible example for this, but the place where it 
> shows up in our workflow is very simple.
> The data in "COLUMN" are a few hundred million distinct strings (gets 
> deduplicated in the plan also) and it is being compared against itself using 
> intersect.
> The code that is failing is essentially:
> {quote}values = [...] # python list containing many unique strings, none of 
> which are None
> df = spark.createDataFrame(
>     spark.sparkContext.parallelize(
>         [(value,) for value in values], numSlices=2 + len(values) // 1
>     ),
>     schema=StructType([StructField("COLUMN", StringType())]),
> )
> df = df.distinct()
> assert df.count() == df.intersect(df).count()
> assert df.count() == df.intersectAll(df).count()
> {quote}
> The issue is that both of the above asserts sometimes pass, and sometimes 
> fail (technically we haven't seen intersectAll pass yet, but we have only 
> tried a few times). One thing which is striking is that if you call 
> df.intersect(df).count() multiple times, the returned count is not always the 
> same. Sometimes it is exactly df.count(), sometimes it is ~1% lower, but how 
> much lower exactly seems random.
> In particular, we have called df.intersect(df).count() twice in a row, and 
> got two different counts, which is very surprising given that df should be 
> deterministic, and suggests maybe there is some kind of 
> concurrency/inconsistent hashing issue?
> One other thing which is possibly noteworthy is that using df.join(df, 
> df.columns, how="inner") does seem to reliably have the desired behavior (not 
> dropping any rows).
> Here is the resulting plan from df.intersect(df)
> {quote}== Parsed Logical Plan ==
> 'Intersect false
> :- Deduplicate [COLUMN#144487]
> :  +- LogicalRDD [COLUMN#144487], false
> +- Deduplicate [COLUMN#144487]
>    +- LogicalRDD [COLUMN#144487], false
> == Analyzed Logical Plan ==
> COLUMN: string
> Intersect false
> :- Deduplicate [COLUMN#144487]
> :  +- LogicalRDD [COLUMN#144487], false
> +- Deduplicate [COLUMN#144523]
>    +- LogicalRDD [COLUMN#144523], false
> == Optimized Logical Plan ==
> Aggregate [COLUMN#144487], [COLUMN#144487]
> +- Join LeftSemi, (COLUMN#144487 <=> COLUMN#144523)
>    :- LogicalRDD [COLUMN#144487], false
>    +- Aggregate [COLUMN#144523], [COLUMN#144523]
>       +- LogicalRDD [COLUMN#144523], false
> == Physical Plan ==
> *(7) HashAggregate(keys=[COLUMN#144487], functions=[], output=[COLUMN#144487])
> +- Exchange hashpartitioning(COLUMN#144487, 200), true, [id=#22790]
>    +- *(6) HashAggregate(keys=[COLUMN#144487], functions=[], 
> output=[COLUMN#144487])
>       +- *(6) SortMergeJoin [coalesce(COLUMN#144487, ), 
> isnull(COLUMN#144487)], [coalesce(COLUMN#144523, ), isnull(COLUMN#144523)], 
> LeftSemi
>          :- *(2) Sort [coalesce(COLUMN#144487, ) ASC NULLS FIRST, 
> isnull(COLUMN#144487) ASC NULLS FIRST], false, 0
>          :  +- Exchange hashpartitioning(coalesce(COLUMN#144487, ), 
> isnull(COLUMN#144487), 200), true, [id=#22772]
>          :     +- *(1) Scan ExistingRDD[COLUMN#144487]
>          +- *(5) Sort [coalesce(COLUMN#144523, ) ASC NULLS FIRST, 
> isnull(COLUMN#144523) ASC NULLS FIRST], false, 0
>             +- Exchange hashpartitioning(coalesce(COLUMN#144523, ), 
> isnull(COLUMN#144523), 200), true, [id=#22782]
>                +- *(4) HashAggregate(keys=[COLUMN#144523], functions=[], 
> output=[COLUMN#144523])
>                   +- Exchange hashpartitioning(COLUMN#144523, 200), true, 
> [id=#22778]
>                      +- *(3) HashAggregate(keys=[COLUMN#144523], 
> functions=[], output=[COLUMN#144523])
>                         +- *(3) Scan ExistingRDD[COLUMN#144523]
> {quote}
> and for df.intersectAll(df)
> {quote}== Parsed Logical Plan ==
> 'IntersectAll true
> :- Deduplicate [COLUMN#144487]
> :  +- LogicalRDD [COLUMN#144487], false
> +- Deduplicate [COLUMN#144487]
>    +- LogicalRDD [COLUMN#144487], false
> == Analyzed Logical Plan ==
> COLUMN: string
> IntersectAll true
> :- Deduplicate [COLUMN#144487]
> :  +- LogicalRDD [COLUMN#144487], false
> +- Deduplicate [COLUMN#144533]
>    +- LogicalRDD [COLUMN#144533], false
> == Optimized Logical Plan ==
> Project [COLUMN#144487]
> +- Generate replicaterows(min_count#144566L, COLUMN#144487), [1], false, 
> [COLUMN#144487]
>    +- Project [COLUMN#144487, if 

[jira] [Updated] (SPARK-40176) Enhance collapse window optimization to work in case partition or order by keys are expressions

2022-08-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40176:
-
Target Version/s:   (was: 3.3.1)

> Enhance collapse window optimization to work in case partition or order by 
> keys are expressions
> ---
>
> Key: SPARK-40176
> URL: https://issues.apache.org/jira/browse/SPARK-40176
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: Ayushi Agarwal
>Priority: Major
> Fix For: 3.3.1
>
>
> In window operator with multiple window functions, if any expression is 
> present in partition by or sort order columns, windows are not collapsed even 
> if partition and order by expression is same for all those window functions.
> E.g. query:
> val w = 
> Window.{_}partitionBy{_}("key").orderBy({_}lower{_}({_}col{_}("value")))
> df.select({_}lead{_}("key", 1).over(w), {_}lead{_}("value", 1).over(w))
> Current Plan:
> -Window(lead(value,1), key, _w1) -- W1
> - Sort (key, _w1)
> -Project (lower(“value”) as _w1) - P1
> -Window(lead(key,1), key, _w0)  W2
> -Sort(key, _w0)
> -Exchange(key)
> -Project (lower(“value”) as _w0)  P2
> -Scan
>  
> W1 and W2 can be merged in single window



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39528) Use V2 Filter in SupportsRuntimeFiltering

2022-08-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584012#comment-17584012
 ] 

Apache Spark commented on SPARK-39528:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/37643

> Use V2 Filter in SupportsRuntimeFiltering
> -
>
> Key: SPARK-39528
> URL: https://issues.apache.org/jira/browse/SPARK-39528
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, SupportsRuntimeFiltering uses v1 filter. We should use v2 filter 
> instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39528) Use V2 Filter in SupportsRuntimeFiltering

2022-08-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584011#comment-17584011
 ] 

Apache Spark commented on SPARK-39528:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/37643

> Use V2 Filter in SupportsRuntimeFiltering
> -
>
> Key: SPARK-39528
> URL: https://issues.apache.org/jira/browse/SPARK-39528
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, SupportsRuntimeFiltering uses v1 filter. We should use v2 filter 
> instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org