[jira] [Updated] (SPARK-45878) ConcurrentModificationException in CliSuite

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45878:
---
Labels: pull-request-available  (was: )

> ConcurrentModificationException in CliSuite
> ---
>
> Key: SPARK-45878
> URL: https://issues.apache.org/jira/browse/SPARK-45878
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> // code placeholder
> java.util.ConcurrentModificationException: mutation occurred during iteration
> [info]   at 
> scala.collection.mutable.MutationTracker$.checkMutations(MutationTracker.scala:43)
> [info]   at 
> scala.collection.mutable.CheckedIndexedSeqView$CheckedIterator.hasNext(CheckedIndexedSeqView.scala:47)
> [info]   at 
> scala.collection.IterableOnceOps.addString(IterableOnce.scala:1247)
> [info]   at 
> scala.collection.IterableOnceOps.addString$(IterableOnce.scala:1241)
> [info]   at scala.collection.AbstractIterable.addString(Iterable.scala:933)
> [info]   at scala.collection.IterableOnceOps.mkString(IterableOnce.scala:1191)
> [info]   at 
> scala.collection.IterableOnceOps.mkString$(IterableOnce.scala:1189)
> [info]   at scala.collection.AbstractIterable.mkString(Iterable.scala:933)
> [info]   at scala.collection.IterableOnceOps.mkString(IterableOnce.scala:1204)
> [info]   at 
> scala.collection.IterableOnceOps.mkString$(IterableOnce.scala:1204)
> [info]   at scala.collection.AbstractIterable.mkString(Iterable.scala:933)
> [info]   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.runCliWithin(CliSuite.scala:205)
> [info]   at 
> org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$new$20(CliSuite.scala:501)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45878) ConcurrentModificationException in CliSuite

2023-11-09 Thread Kent Yao (Jira)
Kent Yao created SPARK-45878:


 Summary: ConcurrentModificationException in CliSuite
 Key: SPARK-45878
 URL: https://issues.apache.org/jira/browse/SPARK-45878
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 4.0.0
Reporter: Kent Yao


{code:java}
// code placeholder

java.util.ConcurrentModificationException: mutation occurred during iteration
[info]   at 
scala.collection.mutable.MutationTracker$.checkMutations(MutationTracker.scala:43)
[info]   at 
scala.collection.mutable.CheckedIndexedSeqView$CheckedIterator.hasNext(CheckedIndexedSeqView.scala:47)
[info]   at scala.collection.IterableOnceOps.addString(IterableOnce.scala:1247)
[info]   at scala.collection.IterableOnceOps.addString$(IterableOnce.scala:1241)
[info]   at scala.collection.AbstractIterable.addString(Iterable.scala:933)
[info]   at scala.collection.IterableOnceOps.mkString(IterableOnce.scala:1191)
[info]   at scala.collection.IterableOnceOps.mkString$(IterableOnce.scala:1189)
[info]   at scala.collection.AbstractIterable.mkString(Iterable.scala:933)
[info]   at scala.collection.IterableOnceOps.mkString(IterableOnce.scala:1204)
[info]   at scala.collection.IterableOnceOps.mkString$(IterableOnce.scala:1204)
[info]   at scala.collection.AbstractIterable.mkString(Iterable.scala:933)
[info]   at 
org.apache.spark.sql.hive.thriftserver.CliSuite.runCliWithin(CliSuite.scala:205)
[info]   at 
org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$new$20(CliSuite.scala:501)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45877) ExecutorFailureTracker support for standalone mode

2023-11-09 Thread Kent Yao (Jira)
Kent Yao created SPARK-45877:


 Summary: ExecutorFailureTracker support for standalone mode
 Key: SPARK-45877
 URL: https://issues.apache.org/jira/browse/SPARK-45877
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Kent Yao


ExecutorFailureTracker now works for k8s and yarn, I guess it also an important 
feature for standalone to have



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45876) Filters are not pushed down across lateral view

2023-11-09 Thread Alexander Petrossian (PAF) (Jira)
Alexander Petrossian (PAF) created SPARK-45876:
--

 Summary: Filters are not pushed down across lateral view
 Key: SPARK-45876
 URL: https://issues.apache.org/jira/browse/SPARK-45876
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Alexander Petrossian (PAF)


{code:python}
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.sql.catalogImplementation", 
"hive").appName("Write ORC File").getOrCreate()
spark.sql('drop TABLE if exists test').show()
spark.sql('CREATE EXTERNAL TABLE test (request 
struct>>)'
'ROW FORMAT SERDE "org.apache.hadoop.hive.ql.io.orc.OrcSerde" '
'STORED AS INPUTFORMAT "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat" '
'OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat" '
'LOCATION "testfolder"').show()
spark.sql("select request from test lateral view 
explode(request.characteristic) cTable as c where 
c.value='7964000'").explain()
{code}

shows
{code}
== Physical Plan ==
*(1) Project [request#2]
+- *(1) Filter (isnotnull(c#4.value) AND (c#4.value = 7964000))
   +- *(1) Generate explode(request#2.characteristic), [request#2], false, [c#4]
  +- *(1) ColumnarToRow
 +- FileScan orc spark_catalog.default.test[request#2] Batched: true, 
DataFilters: [], Format: ORC, Location: InMemoryFileIndex(1 
paths)[file:/Users/paf/Downloads/spark-warehouse/testfolder], PartitionFilters: 
[], PushedFilters: [], ReadSchema: 
struct>>>
{code}

Which is extremely slow.

Suppose I search for a column value, which is totally out of min/max statistics 
range.

Search could have been much faster, but no.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45875) Remove `MissingStageTableRowData` from `core` module

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45875:
---
Labels: pull-request-available  (was: )

> Remove `MissingStageTableRowData` from `core` module 
> -
>
> Key: SPARK-45875
> URL: https://issues.apache.org/jira/browse/SPARK-45875
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45875) Remove `MissingStageTableRowData` from `core` module

2023-11-09 Thread Yang Jie (Jira)
Yang Jie created SPARK-45875:


 Summary: Remove `MissingStageTableRowData` from `core` module 
 Key: SPARK-45875
 URL: https://issues.apache.org/jira/browse/SPARK-45875
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45874) Remove Java version check from `IsolatedClientLoader`

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45874:
---
Labels: pull-request-available  (was: )

> Remove Java version check from `IsolatedClientLoader`
> -
>
> Key: SPARK-45874
> URL: https://issues.apache.org/jira/browse/SPARK-45874
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>
> {code:java}
> val rootClassLoader: ClassLoader =
>   if (SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_9)) {
> // In Java 9, the boot classloader can see few JDK classes. The intended 
> parent
> // classloader for delegation is now the platform classloader.
> // See http://java9.wtf/class-loading/
> val platformCL =
> classOf[ClassLoader].getMethod("getPlatformClassLoader").
>   invoke(null).asInstanceOf[ClassLoader]
> // Check to make sure that the root classloader does not know about Hive.
> 
> assert(Try(platformCL.loadClass("org.apache.hadoop.hive.conf.HiveConf")).isFailure)
> platformCL
>   } else {
> // The boot classloader is represented by null (the instance itself isn't 
> accessible)
> // and before Java 9 can see all JDK classes
> null
>   } {code}
> Spark 4.0.0 has a minimum requirement of Java 17, so the version check for 
> Java 9 is not necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45874) Remove Java version check from `IsolatedClientLoader`

2023-11-09 Thread Yang Jie (Jira)
Yang Jie created SPARK-45874:


 Summary: Remove Java version check from `IsolatedClientLoader`
 Key: SPARK-45874
 URL: https://issues.apache.org/jira/browse/SPARK-45874
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yang Jie


{code:java}
val rootClassLoader: ClassLoader =
  if (SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_9)) {
// In Java 9, the boot classloader can see few JDK classes. The intended 
parent
// classloader for delegation is now the platform classloader.
// See http://java9.wtf/class-loading/
val platformCL =
classOf[ClassLoader].getMethod("getPlatformClassLoader").
  invoke(null).asInstanceOf[ClassLoader]
// Check to make sure that the root classloader does not know about Hive.

assert(Try(platformCL.loadClass("org.apache.hadoop.hive.conf.HiveConf")).isFailure)
platformCL
  } else {
// The boot classloader is represented by null (the instance itself isn't 
accessible)
// and before Java 9 can see all JDK classes
null
  } {code}
Spark 4.0.0 has a minimum requirement of Java 17, so the version check for Java 
9 is not necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45847) CliSuite flakiness due to non-sequential guarantee for stdout

2023-11-09 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-45847:
-
Fix Version/s: 3.4.2

> CliSuite flakiness due to non-sequential guarantee for stdout
> 
>
> Key: SPARK-45847
> URL: https://issues.apache.org/jira/browse/SPARK-45847
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45873) Make ExecutorFailureTracker more tolerant when app remains sufficient resources

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45873:
---
Labels: pull-request-available  (was: )

> Make ExecutorFailureTracker more tolerant when app remains sufficient 
> resources 
> 
>
> Key: SPARK-45873
> URL: https://issues.apache.org/jira/browse/SPARK-45873
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core, YARN
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45873) Make ExecutorFailureTracker more tolerant when app remains sufficient resources

2023-11-09 Thread Kent Yao (Jira)
Kent Yao created SPARK-45873:


 Summary: Make ExecutorFailureTracker more tolerant when app 
remains sufficient resources 
 Key: SPARK-45873
 URL: https://issues.apache.org/jira/browse/SPARK-45873
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Spark Core, YARN
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45872) Update plugin for SBOM generation to 2.7.10

2023-11-09 Thread Vinod Anandan (Jira)
Vinod Anandan created SPARK-45872:
-

 Summary: Update plugin for SBOM generation to 2.7.10
 Key: SPARK-45872
 URL: https://issues.apache.org/jira/browse/SPARK-45872
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.5.0
Reporter: Vinod Anandan


Update the CycloneDX Maven plugin for SBOM generation to 2.7.10



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45871) Change `.toBuffer.toSeq` to `.toSeq`

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45871:
---
Labels: pull-request-available  (was: )

> Change `.toBuffer.toSeq` to `.toSeq`
> 
>
> Key: SPARK-45871
> URL: https://issues.apache.org/jira/browse/SPARK-45871
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45871) Change `.toBuffer.toSeq` to `.toSeq`

2023-11-09 Thread Yang Jie (Jira)
Yang Jie created SPARK-45871:


 Summary: Change `.toBuffer.toSeq` to `.toSeq`
 Key: SPARK-45871
 URL: https://issues.apache.org/jira/browse/SPARK-45871
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45814) ArrowConverters.createEmptyArrowBatch may cause memory leak

2023-11-09 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-45814:
-
Fix Version/s: 4.0.0
   3.5.1

> ArrowConverters.createEmptyArrowBatch may cause memory leak
> ---
>
> Key: SPARK-45814
> URL: https://issues.apache.org/jira/browse/SPARK-45814
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: xie shuiahu
>Assignee: xie shuiahu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1
>
>
> ArrowConverters.createEmptyArrowBatch don't call hasNext, if TaskContext.get 
> is None, then memory leak happens



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45814) ArrowConverters.createEmptyArrowBatch may cause memory leak

2023-11-09 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-45814.
--
Fix Version/s: 3.4.2
   Resolution: Fixed

Issue resolved by pull request 43728
[https://github.com/apache/spark/pull/43728]

> ArrowConverters.createEmptyArrowBatch may cause memory leak
> ---
>
> Key: SPARK-45814
> URL: https://issues.apache.org/jira/browse/SPARK-45814
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: xie shuiahu
>Assignee: xie shuiahu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.2
>
>
> ArrowConverters.createEmptyArrowBatch don't call hasNext, if TaskContext.get 
> is None, then memory leak happens



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43305) Add Java17 dockerfiles for 3.5.0

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-43305:
---
Labels: pull-request-available  (was: )

> Add Java17 dockerfiles for 3.5.0
> 
>
> Key: SPARK-43305
> URL: https://issues.apache.org/jira/browse/SPARK-43305
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 3.5.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43305) Add Java17 dockerfiles for 3.5.0

2023-11-09 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784669#comment-17784669
 ] 

Yikun Jiang commented on SPARK-43305:
-

Resolved by https://github.com/apache/spark-docker/pull/56

> Add Java17 dockerfiles for 3.5.0
> 
>
> Key: SPARK-43305
> URL: https://issues.apache.org/jira/browse/SPARK-43305
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 3.5.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43305) Add Java17 dockerfiles for 3.5.0

2023-11-09 Thread Yikun Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang updated SPARK-43305:

Summary: Add Java17 dockerfiles for 3.5.0  (was: Add Java17 dockerfiles for 
3.4.0)

> Add Java17 dockerfiles for 3.5.0
> 
>
> Key: SPARK-43305
> URL: https://issues.apache.org/jira/browse/SPARK-43305
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 3.5.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45373) Minimizing calls to HiveMetaStore layer for getting partitions, when tables are repeated

2023-11-09 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45373:
-
Shepherd:   (was: Peter Toth)

> Minimizing calls to HiveMetaStore layer for getting partitions,  when tables 
> are repeated
> -
>
> Key: SPARK-45373
> URL: https://issues.apache.org/jira/browse/SPARK-45373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.1
>
>
> In the rule PruneFileSourcePartitions where the CatalogFileIndex gets 
> converted to InMemoryFileIndex,  the HMS calls can get very expensive if :
> 1) The translated filter string for push down to HMS layer becomes empty ,  
> resulting in fetching of all partitions and same table is referenced multiple 
> times in the query. 
> 2) Or just in case same table is referenced multiple times in the query with 
> different partition filters.
> In such cases current code would result in multiple calls to HMS layer. 
> This can be avoided by grouping the tables based on CatalogFileIndex and 
> passing a common minimum filter ( filter1 || filter2) and getting a base 
> PrunedInmemoryFileIndex which can become a basis for each of the specific 
> table.
> Opened following PR for ticket:
> [SPARK-45373-PR|https://github.com/apache/spark/pull/43183]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2023-11-09 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-33152:
-
Affects Version/s: 3.5.0
   (was: 2.4.0)
   (was: 3.0.1)
   (was: 3.1.2)

> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> ---
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM, may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios, does not push compound predicates in Join.
>  # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
>  # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
>  # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
>  # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
>  # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
>  # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
>  # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code incorrectly generated a 
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the 
> existing Constraints code. With these changes, the test correctly creates an 
> EqualTo constraint, resulting in an inferred IsNotNull constraint
>  # It does away with the current combinatorial logic of evaluation all the 
> constraints can cause compilation to run into hours or cause OOM. The number 
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit 
> run time characteristics too, like inferring IsNotNull filter or pushing down 
> compound predicates on the join, which currently may get missed/ does not 
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code, pessimistically tries to generates all 
> the possible combinations of constraints , based on the aliases ( even then 
> it may miss a lot of combinations if the expression is a complex expression 
> involving same attribute repeated multiple times within the expression and 
> there are many aliases to that column). There are query plans in our 
> production env, which can result in intermediate number of constraints going 
> into hundreds of thousands, causing OOM or taking time running into hours. 
> Also there are cases where it incorrectly generates an EqualNullSafe 
> constraint instead of EqualTo constraint , thus missing a possible IsNull 
> constraint on column. 
> Also it only pushes single column predicate on the other side of the join.
> The constraints generated , in some cases, are missing the required ones, and 
> the plan 

[jira] [Updated] (SPARK-45373) Minimizing calls to HiveMetaStore layer for getting partitions, when tables are repeated

2023-11-09 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-45373:
-
Affects Version/s: 3.5.0
   (was: 4.0.0)

> Minimizing calls to HiveMetaStore layer for getting partitions,  when tables 
> are repeated
> -
>
> Key: SPARK-45373
> URL: https://issues.apache.org/jira/browse/SPARK-45373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.1
>
>
> In the rule PruneFileSourcePartitions where the CatalogFileIndex gets 
> converted to InMemoryFileIndex,  the HMS calls can get very expensive if :
> 1) The translated filter string for push down to HMS layer becomes empty ,  
> resulting in fetching of all partitions and same table is referenced multiple 
> times in the query. 
> 2) Or just in case same table is referenced multiple times in the query with 
> different partition filters.
> In such cases current code would result in multiple calls to HMS layer. 
> This can be avoided by grouping the tables based on CatalogFileIndex and 
> passing a common minimum filter ( filter1 || filter2) and getting a base 
> PrunedInmemoryFileIndex which can become a basis for each of the specific 
> table.
> Opened following PR for ticket:
> [SPARK-45373-PR|https://github.com/apache/spark/pull/43183]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns

2023-11-09 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-44662:
-
Affects Version/s: 3.5.0
   (was: 3.5.1)

> SPIP: Improving performance of BroadcastHashJoin queries with stream side 
> join key on non partition columns
> ---
>
> Key: SPARK-44662
> URL: https://issues.apache.org/jira/browse/SPARK-44662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
> Attachments: perf results broadcast var pushdown - Partitioned 
> TPCDS.pdf
>
>
> h2. *Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.*
> On the lines of DPP which helps DataSourceV2 relations when the joining key 
> is a partition column, the same concept can be extended over to the case 
> where joining key is not a partition column.
> The Keys of BroadcastHashJoin are already available before actual evaluation 
> of the stream iterator. These keys can be pushed down to the DataSource as a 
> SortedSet.
> For non partition columns, the DataSources like iceberg have max/min stats on 
> column available at manifest level, and for formats like parquet , they have 
> max/min stats at various storage level. The passed SortedSet can be used to 
> prune using ranges at both driver level ( manifests files) as well as 
> executor level ( while actually going through chunks , row groups etc at 
> parquet level)
> If the data is stored as Columnar Batch format , then it would not be 
> possible to filter out individual row at DataSource level, even though we 
> have keys.
> But at the scan level, ( ColumnToRowExec) it is still possible to filter out 
> as many rows as possible , if the query involves nested joins. Thus reducing 
> the number of rows to join at the higher join levels.
> Will be adding more details..
> h2. *Q2. What problem is this proposal NOT designed to solve?*
> This can only help in BroadcastHashJoin's performance if the join is Inner or 
> Left Semi.
> This will also not work if there are nodes like Expand, Generator , Aggregate 
> (without group by on keys not part of joining column etc) below the 
> BroadcastHashJoin node being targeted.
> h2. *Q3. How is it done today, and what are the limits of current practice?*
> Currently this sort of pruning at DataSource level is being done using DPP 
> (Dynamic Partition Pruning ) and IFF one of the join key column is a 
> Partitioning column ( so that cost of DPP query is justified and way less 
> than amount of data it will be filtering by skipping partitions).
> The limitation is that DPP type approach is not implemented ( intentionally I 
> believe), if the join column is a non partition column ( because of cost of 
> "DPP type" query would most likely be way high as compared to any possible 
> pruning ( especially if the column is not stored in a sorted manner).
> h2. *Q4. What is new in your approach and why do you think it will be 
> successful?*
> 1) This allows pruning on non partition column based joins. 
> 2) Because it piggy backs on Broadcasted Keys, there is no extra cost of "DPP 
> type" query. 
> 3) The Data can be used by DataSource to prune at driver (possibly) and also 
> at executor level ( as in case of parquet which has max/min at various 
> structure levels)
> 4) The big benefit should be seen in multilevel nested join queries. In the 
> current code base, if I am correct, only one join's pruning filter would get 
> pushed at scan level. Since it is on partition key may be that is sufficient. 
> But if it is a nested Join query , and may be involving different columns on 
> streaming side for join, each such filter push could do significant pruning. 
> This requires some handling in case of AQE, as the stream side iterator ( & 
> hence stage evaluation needs to be delayed, till all the available join 
> filters in the nested tree are pushed at their respective target 
> BatchScanExec).
> h4. *Single Row Filteration*
> 5) In case of nested broadcasted joins, if the datasource is column vector 
> oriented , then what spark would get is a ColumnarBatch. But because scans 
> have Filters from multiple joins, they can be retrieved and can be applied in 
> code generated at ColumnToRowExec level, using a new "containsKey" method on 
> HashedRelation. Thus only those rows which satisfy all the 
> BroadcastedHashJoins ( whose keys have been pushed) , will be used for join 
> evaluation.
> The code is already there , the PR on spark side is 
> [spark-broadcast-var|https://github.com/apache/spark/pull/43373]. For non 
> partition table TPCDS run on laptop with TPCDS data size of ( scale factor 

[jira] [Resolved] (SPARK-45850) Upgrade oracle jdbc driver to 23.3.0.23.09

2023-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45850.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43662
[https://github.com/apache/spark/pull/43662]

> Upgrade oracle jdbc driver to 23.3.0.23.09 
> ---
>
> Key: SPARK-45850
> URL: https://issues.apache.org/jira/browse/SPARK-45850
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc 
> driver version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45869) Revisit and Improve Spark Standalone Cluster

2023-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45869:
--
Labels: releasenotes  (was: )

> Revisit and Improve Spark Standalone Cluster
> 
>
> Key: SPARK-45869
> URL: https://issues.apache.org/jira/browse/SPARK-45869
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
> Fix For: 4.0.0
>
>
> This is an experimental internal configuration for advance users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45756) Support `spark.master.useAppNameAsAppId.enabled`

2023-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45756:
--
Parent: SPARK-45869
Issue Type: Sub-task  (was: Improvement)

> Support `spark.master.useAppNameAsAppId.enabled`
> 
>
> Key: SPARK-45756
> URL: https://issues.apache.org/jira/browse/SPARK-45756
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45756) Support `spark.master.useAppNameAsAppId.enabled`

2023-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45756:
--
Labels: pull-request-available  (was: pull-request-available releasenotes)

> Support `spark.master.useAppNameAsAppId.enabled`
> 
>
> Key: SPARK-45756
> URL: https://issues.apache.org/jira/browse/SPARK-45756
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45869) Revisit and Improve Spark Standalone Cluster

2023-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45869:
--
Summary: Revisit and Improve Spark Standalone Cluster  (was: Support 
`spark.master.useAppNameAsAppId.enabled`)

> Revisit and Improve Spark Standalone Cluster
> 
>
> Key: SPARK-45869
> URL: https://issues.apache.org/jira/browse/SPARK-45869
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 4.0.0
>
>
> This is an experimental internal configuration for advance users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45756) Support `spark.master.useAppNameAsAppId.enabled`

2023-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45756:
--
Summary: Support `spark.master.useAppNameAsAppId.enabled`  (was: Revisit 
and Improve Spark Standalone Cluster)

> Support `spark.master.useAppNameAsAppId.enabled`
> 
>
> Key: SPARK-45756
> URL: https://issues.apache.org/jira/browse/SPARK-45756
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available, releasenotes
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45869) Support `spark.master.useAppNameAsAppId.enabled`

2023-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45869:
--
Labels:   (was: pull-request-available)

> Support `spark.master.useAppNameAsAppId.enabled`
> 
>
> Key: SPARK-45869
> URL: https://issues.apache.org/jira/browse/SPARK-45869
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 4.0.0
>
>
> This is an experimental internal configuration for advance users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-45869) Support `spark.master.useAppNameAsAppId.enabled`

2023-11-09 Thread Dongjoon Hyun (Jira)


[ https://issues.apache.org/jira/browse/SPARK-45869 ]


Dongjoon Hyun deleted comment on SPARK-45869:
---

was (Author: dongjoon):
This is resolved via [https://github.com/apache/spark/pull/43743]

> Support `spark.master.useAppNameAsAppId.enabled`
> 
>
> Key: SPARK-45869
> URL: https://issues.apache.org/jira/browse/SPARK-45869
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This is an experimental internal configuration for advance users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45869) Support `spark.master.useAppNameAsAppId.enabled`

2023-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45869:
--
Parent: (was: SPARK-45756)
Issue Type: Improvement  (was: Sub-task)

> Support `spark.master.useAppNameAsAppId.enabled`
> 
>
> Key: SPARK-45869
> URL: https://issues.apache.org/jira/browse/SPARK-45869
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This is an experimental internal configuration for advance users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45850) Upgrade oracle jdbc driver to 23.3.0.23.09

2023-11-09 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-45850:

Component/s: Tests

> Upgrade oracle jdbc driver to 23.3.0.23.09 
> ---
>
> Key: SPARK-45850
> URL: https://issues.apache.org/jira/browse/SPARK-45850
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>
> Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc 
> driver version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45850) Upgrade oracle jdbc driver to 23.3.0.23.09

2023-11-09 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-45850:

Summary: Upgrade oracle jdbc driver to 23.3.0.23.09   (was: Attempt to 
stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc driver version 
)

> Upgrade oracle jdbc driver to 23.3.0.23.09 
> ---
>
> Key: SPARK-45850
> URL: https://issues.apache.org/jira/browse/SPARK-45850
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45869) Support `spark.master.useAppNameAsAppId.enabled`

2023-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45869.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

This is resolved via [https://github.com/apache/spark/pull/43743]

> Support `spark.master.useAppNameAsAppId.enabled`
> 
>
> Key: SPARK-45869
> URL: https://issues.apache.org/jira/browse/SPARK-45869
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This is an experimental internal configuration for advance users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45869) Support `spark.master.useAppNameAsAppId.enabled`

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45869:
---
Labels: pull-request-available  (was: )

> Support `spark.master.useAppNameAsAppId.enabled`
> 
>
> Key: SPARK-45869
> URL: https://issues.apache.org/jira/browse/SPARK-45869
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> This is an experimental internal configuration for advance users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45869) Support `spark.master.useAppNameAsAppId.enabled`

2023-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45869:
-

Assignee: Dongjoon Hyun

> Support `spark.master.useAppNameAsAppId.enabled`
> 
>
> Key: SPARK-45869
> URL: https://issues.apache.org/jira/browse/SPARK-45869
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> This is an experimental internal configuration for advance users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45798) Assert server-side session ID in Spark Connect

2023-11-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45798:


Assignee: Martin Grund

> Assert server-side session ID in Spark Connect
> --
>
> Key: SPARK-45798
> URL: https://issues.apache.org/jira/browse/SPARK-45798
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
>  Labels: pull-request-available
>
> When accessing the Spark Session remotely, it is possible that the server has 
> silently restarted and we loose temporary state like for example views or 
> function definitions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45798) Assert server-side session ID in Spark Connect

2023-11-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45798.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43664
[https://github.com/apache/spark/pull/43664]

> Assert server-side session ID in Spark Connect
> --
>
> Key: SPARK-45798
> URL: https://issues.apache.org/jira/browse/SPARK-45798
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When accessing the Spark Session remotely, it is possible that the server has 
> silently restarted and we loose temporary state like for example views or 
> function definitions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45756) Revisit and Improve Spark Standalone Cluster

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45756:
---
Labels: pull-request-available releasenotes  (was: releasenotes)

> Revisit and Improve Spark Standalone Cluster
> 
>
> Key: SPARK-45756
> URL: https://issues.apache.org/jira/browse/SPARK-45756
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available, releasenotes
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45869) Support `spark.master.useAppNameAsAppId.enabled`

2023-11-09 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-45869:
-

 Summary: Support `spark.master.useAppNameAsAppId.enabled`
 Key: SPARK-45869
 URL: https://issues.apache.org/jira/browse/SPARK-45869
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun


This is an experimental internal configuration for advance users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45600) Make Python data source registration session level

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45600:
---
Labels: pull-request-available  (was: )

> Make Python data source registration session level
> --
>
> Key: SPARK-45600
> URL: https://issues.apache.org/jira/browse/SPARK-45600
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Currently, registered data sources are stored in `sharedState` and can be 
> accessed across multiple sessions. This, however, will not work with Spark 
> Connect. We should make this registration session level, and support static 
> registration (e.g. using pip install) in the future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45600) Make Python data source registration session level

2023-11-09 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45600:
-
Description: Currently, registered data sources are stored in `sharedState` 
and can be accessed across multiple sessions. This, however, will not work with 
Spark Connect. We should make this registration session level, and support 
static registration (e.g. using pip install) in the future.  (was: Currently we 
have added a few instance variables to store information for Python data source 
reader. We should have a dedicated reader class for Python data source to make 
the current DataFrameReader clean.)

> Make Python data source registration session level
> --
>
> Key: SPARK-45600
> URL: https://issues.apache.org/jira/browse/SPARK-45600
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, registered data sources are stored in `sharedState` and can be 
> accessed across multiple sessions. This, however, will not work with Spark 
> Connect. We should make this registration session level, and support static 
> registration (e.g. using pip install) in the future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45600) Make data source registration session level

2023-11-09 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45600:
-
Summary: Make data source registration session level  (was: Separate the 
Python data source logic from DataFrameReader)

> Make data source registration session level
> ---
>
> Key: SPARK-45600
> URL: https://issues.apache.org/jira/browse/SPARK-45600
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently we have added a few instance variables to store information for 
> Python data source reader. We should have a dedicated reader class for Python 
> data source to make the current DataFrameReader clean.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45600) Make Python data source registration session level

2023-11-09 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45600:
-
Summary: Make Python data source registration session level  (was: Make 
data source registration session level)

> Make Python data source registration session level
> --
>
> Key: SPARK-45600
> URL: https://issues.apache.org/jira/browse/SPARK-45600
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently we have added a few instance variables to store information for 
> Python data source reader. We should have a dedicated reader class for Python 
> data source to make the current DataFrameReader clean.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43905) Consolidate BlockId parsing and creation

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-43905:
---
Labels: pull-request-available  (was: )

> Consolidate BlockId parsing and creation
> 
>
> Key: SPARK-43905
> URL: https://issues.apache.org/jira/browse/SPARK-43905
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Henry Mai
>Priority: Minor
>  Labels: pull-request-available
>
> Consolidate BlockId parsing and creation.
> This helps to cut down on errors arising from parsing the BlockId and also 
> eliminates the need to manually synchronize the code across different places 
> that parse and create BlockIds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44179) When a task failed and the inferred task for that task is still executing, the number of dynamically scheduled executors will be calculated incorrectly

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44179:
---
Labels: pull-request-available  (was: )

> When a task failed and the inferred task for that task is still executing, 
> the number of dynamically scheduled executors will be calculated incorrectly
> ---
>
> Key: SPARK-44179
> URL: https://issues.apache.org/jira/browse/SPARK-44179
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: liangyongyuan
>Priority: Major
>  Labels: pull-request-available
>
> Assuming a stage has Task 1, with Task 1.0 and a speculative task Task 1.1 
> running concurrently, the dynamic scheduler calculates the number of 
> executors as 2 (pendingTask: 0, pendingSpeculative: 0, running: 2).
> At this point, Task 1.0 fails, and the dynamic scheduler recalculates the 
> number of executors as 2 (pendingTask: 1, pendingSpeculative: 0, running: 1).
> Due to the failure of Task 1.0, copyRunning(1) becomes 1. As a result, Task 1 
> is speculated again and a SparkListenerSpeculativeTaskSubmitted event is 
> triggered. However, the dynamic scheduler's calculation for the number of 
> executors becomes 3 (pendingTask: 1, pendingSpeculative: 1, running: 1), 
> which is obviously not as expected.
> Then, Task 1.2 starts, and it is marked as a speculative task. However, the 
> dynamic scheduler still calculates the number of executors as 3 (pendingTask: 
> 1, pendingSpeculative: 1, running: 1), which again is not as expected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44595) Make the user session cache number and cache time be configurable in spark connect service

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44595:
---
Labels: pull-request-available  (was: )

> Make the user session cache number and cache time be configurable in spark 
> connect service
> --
>
> Key: SPARK-44595
> URL: https://issues.apache.org/jira/browse/SPARK-44595
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Min Zhao
>Priority: Minor
>  Labels: pull-request-available
>
> Now, the cache size of user session is 100, the cache timeout is 3600. Make 
> them modifiable to meet diverse scenarios.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45731) Update partition statistics with ANALYZE TABLE command

2023-11-09 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-45731.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43629
[https://github.com/apache/spark/pull/43629]

> Update partition statistics with ANALYZE TABLE command
> --
>
> Key: SPARK-45731
> URL: https://issues.apache.org/jira/browse/SPARK-45731
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently {{ANALYZE TABLE}} command only updates table-level stats but not 
> partition stats, even though it can be applied to both non-partitioned and 
> partitioned tables. It seems make sense for it to update partition stats as 
> well.
> Note users can use {{ANALYZE TABLE PARTITION(..)}} to get the same effect, 
> but the syntax is more verbose as they need to specify all the partition 
> columns. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45731) Update partition statistics with ANALYZE TABLE command

2023-11-09 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-45731:


Assignee: Chao Sun

> Update partition statistics with ANALYZE TABLE command
> --
>
> Key: SPARK-45731
> URL: https://issues.apache.org/jira/browse/SPARK-45731
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: pull-request-available
>
> Currently {{ANALYZE TABLE}} command only updates table-level stats but not 
> partition stats, even though it can be applied to both non-partitioned and 
> partitioned tables. It seems make sense for it to update partition stats as 
> well.
> Note users can use {{ANALYZE TABLE PARTITION(..)}} to get the same effect, 
> but the syntax is more verbose as they need to specify all the partition 
> columns. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45867) Support `spark.worker.idPattern`

2023-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45867.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43740
[https://github.com/apache/spark/pull/43740]

> Support `spark.worker.idPattern`
> 
>
> Key: SPARK-45867
> URL: https://issues.apache.org/jira/browse/SPARK-45867
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45867) Support `spark.worker.idPattern`

2023-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45867:
-

Assignee: Dongjoon Hyun

> Support `spark.worker.idPattern`
> 
>
> Key: SPARK-45867
> URL: https://issues.apache.org/jira/browse/SPARK-45867
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45868) Make spark.table use the same parser with vanilla spark

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45868:
---
Labels: pull-request-available  (was: )

> Make spark.table use the same parser with vanilla spark
> ---
>
> Key: SPARK-45868
> URL: https://issues.apache.org/jira/browse/SPARK-45868
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45868) Make spark.table use the same parser with vanilla spark

2023-11-09 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-45868:
-

 Summary: Make spark.table use the same parser with vanilla spark
 Key: SPARK-45868
 URL: https://issues.apache.org/jira/browse/SPARK-45868
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45867) Support `spark.worker.idPattern`

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45867:
---
Labels: pull-request-available  (was: )

> Support `spark.worker.idPattern`
> 
>
> Key: SPARK-45867
> URL: https://issues.apache.org/jira/browse/SPARK-45867
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45867) Support `spark.worker.idPattern`

2023-11-09 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-45867:
-

 Summary: Support `spark.worker.idPattern`
 Key: SPARK-45867
 URL: https://issues.apache.org/jira/browse/SPARK-45867
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45866) Reuse of exchange in AQE does not happen when run time filters are pushed down to the underlying Scan ( like iceberg )

2023-11-09 Thread Asif (Jira)
Asif created SPARK-45866:


 Summary: Reuse of exchange in AQE does not happen when run time 
filters are pushed down to the underlying Scan ( like iceberg )
 Key: SPARK-45866
 URL: https://issues.apache.org/jira/browse/SPARK-45866
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1
Reporter: Asif


In certain types of queries for eg TPCDS Query 14b,  the reuse of exchange does 
not happen in AQE , resulting in perf degradation.
The spark TPCDS tests are unable to catch the problem, because the InMemoryScan 
used for testing do not implement the equals & hashCode correctly , in the 
sense , that they do take into account the pushed down run time filters.

In concrete Scan implementations, for eg iceberg's SparkBatchQueryScan , the 
equality check , apart from other things, also involves Runtime Filters pushed 
( which is correct).

In spark the issue is this:
For a given stage being materialized,  just before materialization starts, the 
run time filters are confined to the BatchScanExec level.
Only when the actual RDD corresponding to the BatchScanExec, is being 
evaluated,  do the runtime filters get pushed to the underlying Scan.

Now if a new stage is created and it checks in the stageCache using its 
canonicalized plan to see if a stage can be reused, it fails to find the 
r-usable  stage even if the stage exists, because the canonicalized spark plan 
present in the stage cache, has now the run time filters pushed to the Scan , 
so the incoming canonicalized spark plan does not match the key as their 
underlying scans differ . that is incoming spark plan's scan does not have 
runtime filters , while the canonicalized spark plan present as key in the 
stage cache has the scan with runtime filters pushed.

The fix as I have worked is to provide, two methods in the 
SupportsRuntimeV2Filtering interface ,
default boolean equalToIgnoreRuntimeFilters(Scan other) {
return this.equals(other);
  }

  default int hashCodeIgnoreRuntimeFilters() {
return this.hashCode();
  }

In the BatchScanExec, if the scan implements SupportsRuntimeV2Filtering, then 
instead of batch.equals, it should call scan.equalToIgnoreRuntimeFilters

And the underlying Scan implementations should provide equality which excludes 
run time filters.

Similarly the hashCode of BatchScanExec, should use 
scan.hashCodeIgnoreRuntimeFilters instead of ( batch.hashCode).

Will be creating a PR with bug test for review.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45865) Add user guide for window operations

2023-11-09 Thread Allison Wang (Jira)
Allison Wang created SPARK-45865:


 Summary: Add user guide for window operations
 Key: SPARK-45865
 URL: https://issues.apache.org/jira/browse/SPARK-45865
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Add a simple user guide for window operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45864) Add user guide for groupby and aggregate

2023-11-09 Thread Allison Wang (Jira)
Allison Wang created SPARK-45864:


 Summary: Add user guide for groupby and aggregate
 Key: SPARK-45864
 URL: https://issues.apache.org/jira/browse/SPARK-45864
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Add a simple user guide to showcase common DataFrame operations involving group 
by and aggregate functions (min, max, count, sum, etc)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45863) Add user guide for column selections

2023-11-09 Thread Allison Wang (Jira)
Allison Wang created SPARK-45863:


 Summary: Add user guide for column selections
 Key: SPARK-45863
 URL: https://issues.apache.org/jira/browse/SPARK-45863
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Add a simple user guide for column selections in PySpark. This should cover the 
following API: lit, df.col, and cover common column operations such as: 
removing a column from a data frame, adding new columns, dropping a duplicate 
column, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45862) Add user guide for basic dataframe operations

2023-11-09 Thread Allison Wang (Jira)
Allison Wang created SPARK-45862:


 Summary: Add user guide for basic dataframe operations
 Key: SPARK-45862
 URL: https://issues.apache.org/jira/browse/SPARK-45862
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Add a simple user guide for basic DataFrame operations. This user guide should 
include the following APIs: select, filter, collect, show



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45861) Add user guide for dataframe creation

2023-11-09 Thread Allison Wang (Jira)
Allison Wang created SPARK-45861:


 Summary: Add user guide for dataframe creation
 Key: SPARK-45861
 URL: https://issues.apache.org/jira/browse/SPARK-45861
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Add a simple user guide for data frame creation.

This user guide should cover the following APIs:
 # df.createDataFrame
 # spark.read.format(...) (can be csv, json, parquet



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45592) AQE and InMemoryTableScanExec correctness bug

2023-11-09 Thread Emil Ejbyfeldt (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emil Ejbyfeldt updated SPARK-45592:
---
Description: 
The following query should return 100
{code:java}
import org.apache.spark.storage.StorageLevel

val df = spark.range(0, 100, 1, 5).map(l => (l, l))
val ee = df.select($"_1".as("src"), $"_2".as("dst"))
  .persist(StorageLevel.MEMORY_AND_DISK)

ee.count()
val minNbrs1 = ee
  .groupBy("src").agg(min(col("dst")).as("min_number"))
  .persist(StorageLevel.MEMORY_AND_DISK)
val join = ee.join(minNbrs1, "src")
join.count(){code}
but on spark 3.5.0 there is a correctness bug causing it to return `104800` or 
some other smaller value.

  was:
The following query should return 100
{code:java}
import org.apache.spark.storage.StorageLevelval

df = spark.range(0, 100, 1, 5).map(l => (l, l))
val ee = df.select($"_1".as("src"), $"_2".as("dst"))
  .persist(StorageLevel.MEMORY_AND_DISK)

ee.count()
val minNbrs1 = ee
  .groupBy("src").agg(min(col("dst")).as("min_number"))
  .persist(StorageLevel.MEMORY_AND_DISK)
val join = ee.join(minNbrs1, "src")
join.count(){code}
but on spark 3.5.0 there is a correctness bug causing it to return `104800` or 
some other smaller value.


> AQE and InMemoryTableScanExec correctness bug
> -
>
> Key: SPARK-45592
> URL: https://issues.apache.org/jira/browse/SPARK-45592
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Emil Ejbyfeldt
>Assignee: Emil Ejbyfeldt
>Priority: Blocker
>  Labels: correctness, pull-request-available
> Fix For: 4.0.0, 3.5.1
>
>
> The following query should return 100
> {code:java}
> import org.apache.spark.storage.StorageLevel
> val df = spark.range(0, 100, 1, 5).map(l => (l, l))
> val ee = df.select($"_1".as("src"), $"_2".as("dst"))
>   .persist(StorageLevel.MEMORY_AND_DISK)
> ee.count()
> val minNbrs1 = ee
>   .groupBy("src").agg(min(col("dst")).as("min_number"))
>   .persist(StorageLevel.MEMORY_AND_DISK)
> val join = ee.join(minNbrs1, "src")
> join.count(){code}
> but on spark 3.5.0 there is a correctness bug causing it to return `104800` 
> or some other smaller value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45860) ClassCastException with SerializedLambda in Spark Cluster Mode

2023-11-09 Thread Abhilash (Jira)
Abhilash created SPARK-45860:


 Summary: ClassCastException with SerializedLambda in Spark Cluster 
Mode
 Key: SPARK-45860
 URL: https://issues.apache.org/jira/browse/SPARK-45860
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Spark Submit
Affects Versions: 3.4.1, 3.2.1
 Environment: *Environment*
Java Version: 11
Spring Boot Version: 2.7.10
Spark Version: 3.2.1
Reporter: Abhilash


h3. Issue Description

Running a Spark application in cluster mode encounters a 
`{*}java.lang.ClassCastException{*}` related to 
`j{*}ava.lang.invoke.SerializedLambda{*}`. This issue seems to be specific to 
the Spark Cluster mode, and it doesn't occur when running the application 
locally without Spring Boot.

 
h3. Steps to Reproduce
 # Create a dummy dataset
{code:java}
Dataset dummyData = spark.createDataset(Arrays.asList("Abhi", "Andrii", 
"Rick", "Duc"), Encoders.STRING()); {code}

 # Call flatMap function to transform the data
{code:java}
Dataset transformedData = dummyData.flatMap(new TestDataFlatMap(), 
Encoders.bean(TestData.class)); {code}

 # Call any action on the transformed dataset
{code:java}
transformedData.show(); {code}

 # Running this Spark application with spark submit command in cluster mode 
with Spring Boot results in the mentioned ClassCastException.

 
h3. *Complete Code:*

 
{code:java}
@SpringBootApplication(exclude = 
{org.springframework.boot.autoconfigure.gson.GsonAutoConfiguration.class})
public class SampleSparkJob{
    public static void main(String[] args) {
        SpringApplication.run(DataIngestionServiceApplication.class, args);

        SparkSession spark = SparkSession.builder()
                .appName("SampleSparkJob")
                .master("local[*]")
                .getOrCreate();
        Dataset dummyData = spark.createDataset(Arrays.asList("Abhi", 
"Andrii", "Rick", "Duc"), Encoders.STRING());
        Dataset transformedData = dummyData.flatMap(new 
TestDataFlatMap(), Encoders.bean(TestData.class));
        transformedData.show();
        transformedData.write().mode("append").parquet("outputpath");
        spark.stop();
    }
}{code}
{code:java}
class TestDataFlatMap implements FlatMapFunction, 
Serializable {
    @Override
    public Iterator call(String name) {
        return Arrays.asList(new TestData(name)).iterator();
    }
}{code}
{code:java}
@Data
@AllArgsConstructor
public class TestData implements Serializable {
    private String name;
} {code}
 
h3. 
Stack trace:
{code:java}
WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (10.248.66.38 executor 
0): java.lang.ClassCastException: cannot assign instance of 
java.lang.invoke.SerializedLambda to field 
org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of 
org.apache.spark.rdd.MapPartitionsRDD  at 
java.base/java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2076)
at 
java.base/java.io.ObjectStreamClass$FieldReflector.checkObjectFieldValueTypes(ObjectStreamClass.java:2039)
   at 
java.base/java.io.ObjectStreamClass.checkObjFieldValueTypes(ObjectStreamClass.java:1293)
 at 
java.base/java.io.ObjectInputStream.defaultCheckFieldValues(ObjectInputStream.java:2512)
 at 
java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2419) 
 at 
java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2228)
  at 
java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1687) at 
java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2496)
   at 
java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2390) 
 at 
java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2228)
  at 
java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1687) at 
java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:489)   at 
java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:447)   at 
scala.collection.immutable.List$SerializationProxy.readObject(List.scala:527)   
 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)   at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.base/java.lang.reflect.Method.invoke(Method.java:566)   at 
java.base/java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1046)
at 
java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2357) 
 at 
java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2228)
  at 
java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1687) at 

[jira] [Updated] (SPARK-45859) Make UDF objects in ml.functions lazy

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45859:
---
Labels: pull-request-available  (was: )

> Make UDF objects in ml.functions lazy
> -
>
> Key: SPARK-45859
> URL: https://issues.apache.org/jira/browse/SPARK-45859
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0, 4.0.0, 3.0, 3.1
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45859) Make UDF objects in ml.functions lazy

2023-11-09 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-45859:
-

 Summary: Make UDF objects in ml.functions lazy
 Key: SPARK-45859
 URL: https://issues.apache.org/jira/browse/SPARK-45859
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0, 3.1, 3.5.0, 3.4.0, 3.3.0, 3.2.0, 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken

2023-11-09 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784567#comment-17784567
 ] 

Asif commented on SPARK-45658:
--

I also think that during canonicalization of DynamicPruningSubquery, the 
pruning key's canonicalization should be done on the basis of the enclosing 
Plan which contains the DynamicPruningSubquery Expression

> Canonicalization of DynamicPruningSubquery is broken
> 
>
> Key: SPARK-45658
> URL: https://issues.apache.org/jira/browse/SPARK-45658
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> The canonicalization of (buildKeys: Seq[Expression]) in the class 
> DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by 
> calling 
> buildKeys.map(_.canonicalized)
> The  above would result in incorrect canonicalization as it would not be 
> normalizing the exprIds relative to buildQuery output
> The fix is to use the buildQuery : LogicalPlan's output to normalize the 
> buildKeys expression
> as given below, using the standard approach.
> buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)),
> Will be filing a PR and bug test for the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44609) ExecutorPodsAllocator doesn't create new executors if no pod snapshot captured pod creation

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44609:
---
Labels: pull-request-available  (was: )

> ExecutorPodsAllocator doesn't create new executors if no pod snapshot 
> captured pod creation
> ---
>
> Key: SPARK-44609
> URL: https://issues.apache.org/jira/browse/SPARK-44609
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Scheduler
>Affects Versions: 3.4.1
>Reporter: Alibi Yeslambek
>Priority: Major
>  Labels: pull-request-available
>
> There’s a following race condition in ExecutorPodsAllocator when running a 
> spark application with static allocation on kubernetes with numExecutors >= 1:
>  * Driver requests an executor
>  * exec-1 gets created and registers with driver
>  * exec-1 is moved from {{newlyCreatedExecutors}} to 
> {{schedulerKnownNewlyCreatedExecs}}
>  * exec-1 got deleted very quickly (~1-30 sec) after registration
>  * {{ExecutorPodsWatchSnapshotSource}} fails to catch the creation of the pod 
> (e.g. websocket connection was reset, k8s-apiserver was down, etc.)
>  * {{ExecutorPodsPollingSnapshotSource}} fails to catch the creation because 
> it runs every 30 secs, but executor was removed much quicker after creation
>  * exec-1 is never removed from {{schedulerKnownNewlyCreatedExecs}}
>  * {{ExecutorPodsAllocator}} will never request new executor because it’s 
> slot is occupied by exec-1, due to {{schedulerKnownNewlyCreatedExecs}} never 
> being cleared.
>  
> Put up a fix here https://github.com/apache/spark/pull/42297



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45592) AQE and InMemoryTableScanExec correctness bug

2023-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45592:
-

Assignee: Emil Ejbyfeldt  (was: Apache Spark)

> AQE and InMemoryTableScanExec correctness bug
> -
>
> Key: SPARK-45592
> URL: https://issues.apache.org/jira/browse/SPARK-45592
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Emil Ejbyfeldt
>Assignee: Emil Ejbyfeldt
>Priority: Blocker
>  Labels: correctness, pull-request-available
> Fix For: 4.0.0, 3.5.1
>
>
> The following query should return 100
> {code:java}
> import org.apache.spark.storage.StorageLevelval
> df = spark.range(0, 100, 1, 5).map(l => (l, l))
> val ee = df.select($"_1".as("src"), $"_2".as("dst"))
>   .persist(StorageLevel.MEMORY_AND_DISK)
> ee.count()
> val minNbrs1 = ee
>   .groupBy("src").agg(min(col("dst")).as("min_number"))
>   .persist(StorageLevel.MEMORY_AND_DISK)
> val join = ee.join(minNbrs1, "src")
> join.count(){code}
> but on spark 3.5.0 there is a correctness bug causing it to return `104800` 
> or some other smaller value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45592) AQE and InMemoryTableScanExec correctness bug

2023-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45592:
--
Target Version/s: 3.4.2

> AQE and InMemoryTableScanExec correctness bug
> -
>
> Key: SPARK-45592
> URL: https://issues.apache.org/jira/browse/SPARK-45592
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Emil Ejbyfeldt
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: correctness, pull-request-available
> Fix For: 4.0.0, 3.5.1
>
>
> The following query should return 100
> {code:java}
> import org.apache.spark.storage.StorageLevelval
> df = spark.range(0, 100, 1, 5).map(l => (l, l))
> val ee = df.select($"_1".as("src"), $"_2".as("dst"))
>   .persist(StorageLevel.MEMORY_AND_DISK)
> ee.count()
> val minNbrs1 = ee
>   .groupBy("src").agg(min(col("dst")).as("min_number"))
>   .persist(StorageLevel.MEMORY_AND_DISK)
> val join = ee.join(minNbrs1, "src")
> join.count(){code}
> but on spark 3.5.0 there is a correctness bug causing it to return `104800` 
> or some other smaller value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45592) AQE and InMemoryTableScanExec correctness bug

2023-11-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45592:
--
Affects Version/s: 3.4.1

> AQE and InMemoryTableScanExec correctness bug
> -
>
> Key: SPARK-45592
> URL: https://issues.apache.org/jira/browse/SPARK-45592
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Emil Ejbyfeldt
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: correctness, pull-request-available
> Fix For: 4.0.0, 3.5.1
>
>
> The following query should return 100
> {code:java}
> import org.apache.spark.storage.StorageLevelval
> df = spark.range(0, 100, 1, 5).map(l => (l, l))
> val ee = df.select($"_1".as("src"), $"_2".as("dst"))
>   .persist(StorageLevel.MEMORY_AND_DISK)
> ee.count()
> val minNbrs1 = ee
>   .groupBy("src").agg(min(col("dst")).as("min_number"))
>   .persist(StorageLevel.MEMORY_AND_DISK)
> val join = ee.join(minNbrs1, "src")
> join.count(){code}
> but on spark 3.5.0 there is a correctness bug causing it to return `104800` 
> or some other smaller value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45858) Consistent FetchFailed/NoSuchFileExceptions when decommissioning is enabled

2023-11-09 Thread Alibi Yeslambek (Jira)
Alibi Yeslambek created SPARK-45858:
---

 Summary: Consistent FetchFailed/NoSuchFileExceptions when 
decommissioning is enabled
 Key: SPARK-45858
 URL: https://issues.apache.org/jira/browse/SPARK-45858
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Alibi Yeslambek


Decommissioning causes FetchFailures with NoSuchFileException due to multiple 
tasks on the same partition from different stage attempts sharing a single 
MapStatus object. Is there any workaround/config flag that I’m missing that 
will fix the issue or is this rather a bug?

*Example*
Here are same tasks from different stage attempts for the same partition:
{code:java}
INFO [2023-11-07T17:50:03.399091Z] org.apache.spark.scheduler.TaskSetManager: 
Starting task 16.0 in stage 11.1 (TID 1810) (10.0.158.211, executor 5, 
partition 81, PROCESS_LOCAL, 4743 bytes) taskResourceAssignments Map()
INFO [2023-11-07T17:51:20.229168Z] org.apache.spark.scheduler.TaskSetManager: 
Starting task 13.0 in stage 11.2 (TID 1836) (10.0.187.67, executor 6, partition 
81, PROCESS_LOCAL, 4743 bytes) taskResourceAssignments Map() {code}
The latest mapStatus.location for partition 81 will be the latest succeeded 
task (exec-6) , i.e:
{code:java}
mapStatus(81).location = BlockManagerId(6, 10.0.187.67, 7079, None){code}
Which means that multiple MapIDs point to the same MapIndex and share one 
MapStatus object. In this example:

 
{code:java}
mapIdToMapIndex(1810) = 81
mapIdToMapIndex(1836) = 81 
{code}
Now if we decommission exec-5, all of its blocks (including 1810) will be 
migrated and driver mapStatuses will updated.
{code:java}
INFO [2023-11-07T17:57:23.545274Z] org.apache.spark.ShuffleStatus: Updating map 
output for 1810 to BlockManagerId(4, 10.0.153.179, 7079, None){code}
Which updates mapStatus.location for partition 81 to exec-4:
{code:java}
mapStatus(81).location = BlockManagerId(4, 10.0.153.179, 7079, None){code}
And when a task from different stage tries to fetch block for {{MapId: 1836}} , 
driver will return it’s location as exec-4, whereas in fact it is still on 
exec-6. The task will fail with FetchFailure caused by NoSuchFileException, 
because the actual block is in exec-6.
{code:java}
WARN [2023-11-07T17:58:40.008602Z] org.apache.spark.scheduler.TaskSetManager: 
Lost task 14.0 in stage 16.0 (TID 1988) (10.0.156.83 executor 24): 
FetchFailed(BlockManagerId(4, 10.0.153.179, 7079, None), shuffleId=5, 
mapIndex=81, mapId=1836, reduceId=84, message=
org.apache.spark.shuffle.FetchFailedException
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1167)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:903)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:84)
at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage10.sort_addToSorter_0$(generated.java:31)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage10.processNext(generated.java:43)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:776)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage12.smj_findNextJoinRows_0$(generated.java:40)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage12.processNext(generated.java:101)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:795)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$1(Executor.scala:516)
at 

[jira] [Updated] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45658:
---
Labels: pull-request-available  (was: )

> Canonicalization of DynamicPruningSubquery is broken
> 
>
> Key: SPARK-45658
> URL: https://issues.apache.org/jira/browse/SPARK-45658
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> The canonicalization of (buildKeys: Seq[Expression]) in the class 
> DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by 
> calling 
> buildKeys.map(_.canonicalized)
> The  above would result in incorrect canonicalization as it would not be 
> normalizing the exprIds relative to buildQuery output
> The fix is to use the buildQuery : LogicalPlan's output to normalize the 
> buildKeys expression
> as given below, using the standard approach.
> buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)),
> Will be filing a PR and bug test for the same.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45857) Enforce the error classes in sub-classes of AnalysisException

2023-11-09 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-45857:
-
Description: Make the error class in sub-classes of AnalysisException 
mandatory to enforce callers to always set it. This simplifies migration on 
error classes.  (was: Make the error class in sub-classes of ParseException 
mandatory to enforce callers to always set it. This simplifies migration on 
error classes.)

> Enforce the error classes in sub-classes of AnalysisException
> -
>
> Key: SPARK-45857
> URL: https://issues.apache.org/jira/browse/SPARK-45857
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Make the error class in sub-classes of AnalysisException mandatory to enforce 
> callers to always set it. This simplifies migration on error classes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45857) Enforce the error classes in sub-classes of AnalysisException

2023-11-09 Thread Max Gekk (Jira)
Max Gekk created SPARK-45857:


 Summary: Enforce the error classes in sub-classes of 
AnalysisException
 Key: SPARK-45857
 URL: https://issues.apache.org/jira/browse/SPARK-45857
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Max Gekk
Assignee: Max Gekk
 Fix For: 4.0.0


Make the error class in ParseException mandatory to enforce callers to always 
set it. This simplifies migration on error classes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45857) Enforce the error classes in sub-classes of AnalysisException

2023-11-09 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-45857:
-
Description: Make the error class in sub-classes of ParseException 
mandatory to enforce callers to always set it. This simplifies migration on 
error classes.  (was: Make the error class in ParseException mandatory to 
enforce callers to always set it. This simplifies migration on error classes.)

> Enforce the error classes in sub-classes of AnalysisException
> -
>
> Key: SPARK-45857
> URL: https://issues.apache.org/jira/browse/SPARK-45857
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Make the error class in sub-classes of ParseException mandatory to enforce 
> callers to always set it. This simplifies migration on error classes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45855) Unable to set compression codec for Hive CTAS

2023-11-09 Thread Tim Robertson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson resolved SPARK-45855.
---
Resolution: Fixed

I found this is fixed in 3.5.0 and I strongly suspect is caused by the same 
thing documented and fixed in  #43504 

> Unable to set compression codec for Hive CTAS
> -
>
> Key: SPARK-45855
> URL: https://issues.apache.org/jira/browse/SPARK-45855
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
> Environment: Spark 3.4.0 
> Stackable.tech release 23.7.0 which runs spark on K8s.
>Reporter: Tim Robertson
>Priority: Major
> Fix For: 3.5.0
>
>
> Hi,
> We've discovered code that worked in Spark 3.3.0 doesn't in 3.4.0. I can't 
> find anything in the release notes to indicate why, so I wonder if this is a 
> bug. Thank you for looking.
> Here we're using our own custom codec, but we noticed we can't set gzip 
> either.
> {{  SparkConf conf = spark.sparkContext().conf();}}
> {{  conf.set("hive.exec.compress.output", "true");}}
> {{  conf.set("mapred.output.compression.codec", D2Codec.class.getName()); }}
> {{  spark.sql("CREATE TABLE b AS SELECT id FROM a");}}
> This will create the table, but it writes uncompressed files, where Spark 
> 3.3.0 would write compressed files. 
> Any advice is appreciated and I can help run tests. We run Spark on K8S using 
> the stackable.tech distribution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45855) Unable to set compression codec for Hive CTAS

2023-11-09 Thread Tim Robertson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784533#comment-17784533
 ] 

Tim Robertson commented on SPARK-45855:
---

I suspect it is this https://issues.apache.org/jira/browse/SPARK-43504

> Unable to set compression codec for Hive CTAS
> -
>
> Key: SPARK-45855
> URL: https://issues.apache.org/jira/browse/SPARK-45855
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
> Environment: Spark 3.4.0 
> Stackable.tech release 23.7.0 which runs spark on K8s.
>Reporter: Tim Robertson
>Priority: Major
>
> Hi,
> We've discovered code that worked in Spark 3.3.0 doesn't in 3.4.0. I can't 
> find anything in the release notes to indicate why, so I wonder if this is a 
> bug. Thank you for looking.
> Here we're using our own custom codec, but we noticed we can't set gzip 
> either.
> {{  SparkConf conf = spark.sparkContext().conf();}}
> {{  conf.set("hive.exec.compress.output", "true");}}
> {{  conf.set("mapred.output.compression.codec", D2Codec.class.getName()); }}
> {{  spark.sql("CREATE TABLE b AS SELECT id FROM a");}}
> This will create the table, but it writes uncompressed files, where Spark 
> 3.3.0 would write compressed files. 
> Any advice is appreciated and I can help run tests. We run Spark on K8S using 
> the stackable.tech distribution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45855) Unable to set compression codec for Hive CTAS

2023-11-09 Thread Tim Robertson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated SPARK-45855:
--
Fix Version/s: 3.5.0

> Unable to set compression codec for Hive CTAS
> -
>
> Key: SPARK-45855
> URL: https://issues.apache.org/jira/browse/SPARK-45855
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
> Environment: Spark 3.4.0 
> Stackable.tech release 23.7.0 which runs spark on K8s.
>Reporter: Tim Robertson
>Priority: Major
> Fix For: 3.5.0
>
>
> Hi,
> We've discovered code that worked in Spark 3.3.0 doesn't in 3.4.0. I can't 
> find anything in the release notes to indicate why, so I wonder if this is a 
> bug. Thank you for looking.
> Here we're using our own custom codec, but we noticed we can't set gzip 
> either.
> {{  SparkConf conf = spark.sparkContext().conf();}}
> {{  conf.set("hive.exec.compress.output", "true");}}
> {{  conf.set("mapred.output.compression.codec", D2Codec.class.getName()); }}
> {{  spark.sql("CREATE TABLE b AS SELECT id FROM a");}}
> This will create the table, but it writes uncompressed files, where Spark 
> 3.3.0 would write compressed files. 
> Any advice is appreciated and I can help run tests. We run Spark on K8S using 
> the stackable.tech distribution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45855) Unable to set compression codec for Hive CTAS

2023-11-09 Thread Tim Robertson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784530#comment-17784530
 ] 

Tim Robertson commented on SPARK-45855:
---

This also seems to fail with 3.4.1 but seems to be fixed in 3.5.0. 

I'm yet to find out why, so I can link it and close this.

> Unable to set compression codec for Hive CTAS
> -
>
> Key: SPARK-45855
> URL: https://issues.apache.org/jira/browse/SPARK-45855
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
> Environment: Spark 3.4.0 
> Stackable.tech release 23.7.0 which runs spark on K8s.
>Reporter: Tim Robertson
>Priority: Major
>
> Hi,
> We've discovered code that worked in Spark 3.3.0 doesn't in 3.4.0. I can't 
> find anything in the release notes to indicate why, so I wonder if this is a 
> bug. Thank you for looking.
> Here we're using our own custom codec, but we noticed we can't set gzip 
> either.
> {{  SparkConf conf = spark.sparkContext().conf();}}
> {{  conf.set("hive.exec.compress.output", "true");}}
> {{  conf.set("mapred.output.compression.codec", D2Codec.class.getName()); }}
> {{  spark.sql("CREATE TABLE b AS SELECT id FROM a");}}
> This will create the table, but it writes uncompressed files, where Spark 
> 3.3.0 would write compressed files. 
> Any advice is appreciated and I can help run tests. We run Spark on K8S using 
> the stackable.tech distribution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45856) Move ArtifactManager from Spark Connect into SparkSession (sql/core)

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45856:
---
Labels: pull-request-available  (was: )

> Move ArtifactManager from Spark Connect into SparkSession (sql/core)
> 
>
> Key: SPARK-45856
> URL: https://issues.apache.org/jira/browse/SPARK-45856
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, SQL
>Affects Versions: 4.0.0
>Reporter: Venkata Sai Akhil Gudesa
>Priority: Major
>  Labels: pull-request-available
>
> The `ArtifactManager` that currently lies in the connect package can be moved 
> into the wider sql/core package (e.g SparkSession) to expand the scope. This 
> is possible because the `ArtifactManager` is tied solely to the 
> `SparkSession#sessionUUID` and hence can be cleanly detached from Spark 
> Connect and be made generally available.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45855) Unable to set compression codec for Hive CTAS

2023-11-09 Thread Tim Robertson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated SPARK-45855:
--
Summary: Unable to set compression codec for Hive CTAS  (was: Unable to set 
codec for Hive CTAS)

> Unable to set compression codec for Hive CTAS
> -
>
> Key: SPARK-45855
> URL: https://issues.apache.org/jira/browse/SPARK-45855
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
> Environment: Spark 3.4.0 
> Stackable.tech release 23.7.0 which runs spark on K8s.
>Reporter: Tim Robertson
>Priority: Major
>
> Hi,
> We've discovered code that worked in Spark 3.3.0 doesn't in 3.4.0. I can't 
> find anything in the release notes to indicate why, so I wonder if this is a 
> bug. Thank you for looking.
> Here we're using our own custom codec, but we noticed we can't set gzip 
> either.
> {{  SparkConf conf = spark.sparkContext().conf();}}
> {{  conf.set("hive.exec.compress.output", "true");}}
> {{  conf.set("mapred.output.compression.codec", D2Codec.class.getName()); }}
> {{  spark.sql("CREATE TABLE b AS SELECT id FROM a");}}
> This will create the table, but it writes uncompressed files, where Spark 
> 3.3.0 would write compressed files. 
> Any advice is appreciated and I can help run tests. We run Spark on K8S using 
> the stackable.tech distribution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45855) Unable to set codec for Hive CTAS

2023-11-09 Thread Tim Robertson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Robertson updated SPARK-45855:
--
Description: 
Hi,

We've discovered code that worked in Spark 3.3.0 doesn't in 3.4.0. I can't find 
anything in the release notes to indicate why, so I wonder if this is a bug. 
Thank you for looking.

Here we're using our own custom codec, but we noticed we can't set gzip either.

{{  SparkConf conf = spark.sparkContext().conf();}}
{{  conf.set("hive.exec.compress.output", "true");}}
{{  conf.set("mapred.output.compression.codec", D2Codec.class.getName()); }}
{{  spark.sql("CREATE TABLE b AS SELECT id FROM a");}}

This will create the table, but it writes uncompressed files, where Spark 3.3.0 
would write compressed files. 

Any advice is appreciated and I can help run tests. We run Spark on K8S using 
the stackable.tech distribution.

  was:
Hi,

We've discovered code that worked in Spark 3.3.0 doesn't in 3.4.0. I can't find 
anything in the release notes to indicate why, so I wonder if this is a bug. 
Thank you for looking.

Here we're using our own custom codec, but we noticed we can't set gzip either.



{{  SparkConf conf = spark.sparkContext().conf();}}
{{  conf.set("hive.exec.compress.output", "true");}}
{{  conf.set("mapred.output.compression.codec", D2Codec.class.getName()); }}
{{  spark.sql("CREATE TABLE b AS SELECT id FROM a");}}
 

Any advice is appreciated and I can help run tests. We run Spark on K8S using 
the stackable.tech distribution.


> Unable to set codec for Hive CTAS
> -
>
> Key: SPARK-45855
> URL: https://issues.apache.org/jira/browse/SPARK-45855
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
> Environment: Spark 3.4.0 
> Stackable.tech release 23.7.0 which runs spark on K8s.
>Reporter: Tim Robertson
>Priority: Major
>
> Hi,
> We've discovered code that worked in Spark 3.3.0 doesn't in 3.4.0. I can't 
> find anything in the release notes to indicate why, so I wonder if this is a 
> bug. Thank you for looking.
> Here we're using our own custom codec, but we noticed we can't set gzip 
> either.
> {{  SparkConf conf = spark.sparkContext().conf();}}
> {{  conf.set("hive.exec.compress.output", "true");}}
> {{  conf.set("mapred.output.compression.codec", D2Codec.class.getName()); }}
> {{  spark.sql("CREATE TABLE b AS SELECT id FROM a");}}
> This will create the table, but it writes uncompressed files, where Spark 
> 3.3.0 would write compressed files. 
> Any advice is appreciated and I can help run tests. We run Spark on K8S using 
> the stackable.tech distribution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45855) Unable to set codec for Hive CTAS

2023-11-09 Thread Tim Robertson (Jira)
Tim Robertson created SPARK-45855:
-

 Summary: Unable to set codec for Hive CTAS
 Key: SPARK-45855
 URL: https://issues.apache.org/jira/browse/SPARK-45855
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
 Environment: Spark 3.4.0 
Stackable.tech release 23.7.0 which runs spark on K8s.
Reporter: Tim Robertson


Hi,

We've discovered code that worked in Spark 3.3.0 doesn't in 3.4.0. I can't find 
anything in the release notes to indicate why, so I wonder if this is a bug. 
Thank you for looking.

Here we're using our own custom codec, but we noticed we can't set gzip either.



{{  SparkConf conf = spark.sparkContext().conf();}}
{{  conf.set("hive.exec.compress.output", "true");}}
{{  conf.set("mapred.output.compression.codec", D2Codec.class.getName()); }}
{{  spark.sql("CREATE TABLE b AS SELECT id FROM a");}}
 

Any advice is appreciated and I can help run tests. We run Spark on K8S using 
the stackable.tech distribution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45849) Remove uneccessary toSeq when encoding Set to catalyst

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45849:
---
Labels: pull-request-available  (was: )

> Remove uneccessary toSeq when encoding Set to catalyst
> --
>
> Key: SPARK-45849
> URL: https://issues.apache.org/jira/browse/SPARK-45849
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Emil Ejbyfeldt
>Priority: Minor
>  Labels: pull-request-available
>
> Currently when encoding Sets to catalyst we first convert them into a Seq. 
> There is no good reason to do this as the interface we are targeting for 
> encoding is only `Iterable` which is implemented by Set. So by using Iterable 
> instead of Seq in some places we should be able to avoid this extra copy when 
> encoding Sets.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45854) spark.catalog.listTables fails with ParseException after upgrading to Spark 3.4.1 from 3.3.1

2023-11-09 Thread Andrej Zachar (Jira)
Andrej Zachar created SPARK-45854:
-

 Summary: spark.catalog.listTables fails with ParseException after 
upgrading to Spark 3.4.1 from 3.3.1
 Key: SPARK-45854
 URL: https://issues.apache.org/jira/browse/SPARK-45854
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core, Spark Submit
Affects Versions: 3.4.1, 3.4.0
Reporter: Andrej Zachar


After upgrading to Spark 3.4.1, the listTables() method in PySpark now throws a 
ParseException with the message "Syntax error at or near end of input.". This 
did not occur in previous versions of Spark, such as 3.3.1.

Install Spark version 3.4.1.
 
Run pyspark
```bash
{{pyspark --packages io.delta:delta-core_2.12:2.4.0 --conf 
"spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"}}
```
 
Attempt to list tables using
```console
{{spark.range(1).createTempView("test_view")}}
{{spark.catalog.listTables()}}
```
Expected result: The listTables() method should return a list of tables without 
throwing any exceptions.

Actual result: 
{{Traceback (most recent call last):}}
{{File "", line 1, in }}
{{File ".venv/lib/python3.10/site-packages/pyspark/sql/catalog.py", line 302, 
in listTables}}
{{iter = self._jcatalog.listTables(dbName).toLocalIterator()}}
{{File 
".venv/lib/python3.10/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
 line 1322, in _{_}call{_}_}}
{{File 
".venv/lib/python3.10/site-packages/pyspark/errors/exceptions/captured.py", 
line 175, in deco}}
{{raise converted from None}}
{{pyspark.errors.exceptions.captured.ParseException:}}
{{[PARSE_SYNTAX_ERROR] Syntax error at or near end of input.(line 1, pos 0)}}

== SQL ==

^^^

>>>

The same code worked correctly in Spark version 3.3.1.
No changes were made to the code aside from upgrading Spark.

Thank you for considering this issue! Any assistance in resolving it would be 
greatly appreciated.

Best regards,
Andrej



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45853) Add Iceberg and Hudi to third party projects

2023-11-09 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-45853:
---

 Summary: Add Iceberg and Hudi to third party projects
 Key: SPARK-45853
 URL: https://issues.apache.org/jira/browse/SPARK-45853
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Yuming Wang



{noformat}
Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
java.util.concurrent.ExecutionException: 
org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to 
find the data source: iceberg. Please find packages at 
`https://spark.apache.org/third-party-projects.html`.
at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:46)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:262)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:166)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at 
org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79)
at 
org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:41)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:166)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:161)
at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:175)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45850) Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc driver version

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45850:
---
Labels: pull-request-available  (was: )

> Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc 
> driver version 
> ---
>
> Key: SPARK-45850
> URL: https://issues.apache.org/jira/browse/SPARK-45850
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45852) Gracefully deal with recursion exception during Spark Connect logging

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45852:
---
Labels: pull-request-available  (was: )

> Gracefully deal with recursion exception during Spark Connect logging
> -
>
> Key: SPARK-45852
> URL: https://issues.apache.org/jira/browse/SPARK-45852
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Martin Grund
>Priority: Major
>  Labels: pull-request-available
>
> ```
> from google.protobuf.text_format import MessageToString
> from pyspark.sql.functions import col, lit
> df = spark.range(10)
> for x in range(800):
>   df = df.withColumn(f"next{x}", lit(1))
>   MessageToString(df._plan.to_proto(spark._client), as_one_line=True)
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45852) Gracefully deal with recursion exception during Spark Connect logging

2023-11-09 Thread Martin Grund (Jira)
Martin Grund created SPARK-45852:


 Summary: Gracefully deal with recursion exception during Spark 
Connect logging
 Key: SPARK-45852
 URL: https://issues.apache.org/jira/browse/SPARK-45852
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Martin Grund


```
from google.protobuf.text_format import MessageToString
from pyspark.sql.functions import col, lit

df = spark.range(10)

for x in range(800):
  df = df.withColumn(f"next{x}", lit(1))
  MessageToString(df._plan.to_proto(spark._client), as_one_line=True)

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45815) Provide an interface for Streaming sources to add _metadata columns

2023-11-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-45815:
---

Assignee: Yaohua Zhao

> Provide an interface for Streaming sources to add _metadata columns
> ---
>
> Key: SPARK-45815
> URL: https://issues.apache.org/jira/browse/SPARK-45815
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.5.1
>Reporter: Yaohua Zhao
>Assignee: Yaohua Zhao
>Priority: Major
>  Labels: pull-request-available
>
> Currently, only the native V1 file-based streaming source can read the 
> `_metadata` column: 
> [https://github.com/apache/spark/blob/370870b7a0303e4a2c4b3dea1b479b4fcbc93f8d/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala#L63]
>  
> Our goal is to create an interface that allows other streaming sources to add 
> `{{{}_metadata`{}}} columns. For instance, we would like the Delta Streaming 
> source, which you can find here: 
> [https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaDataSource.scala#L49],
>  to extend this interface and provide the `{{{}_metadata`{}}} column for its 
> underlying storage format, such as Parquet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45815) Provide an interface for Streaming sources to add _metadata columns

2023-11-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-45815.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43692
[https://github.com/apache/spark/pull/43692]

> Provide an interface for Streaming sources to add _metadata columns
> ---
>
> Key: SPARK-45815
> URL: https://issues.apache.org/jira/browse/SPARK-45815
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.5.1
>Reporter: Yaohua Zhao
>Assignee: Yaohua Zhao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, only the native V1 file-based streaming source can read the 
> `_metadata` column: 
> [https://github.com/apache/spark/blob/370870b7a0303e4a2c4b3dea1b479b4fcbc93f8d/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala#L63]
>  
> Our goal is to create an interface that allows other streaming sources to add 
> `{{{}_metadata`{}}} columns. For instance, we would like the Delta Streaming 
> source, which you can find here: 
> [https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaDataSource.scala#L49],
>  to extend this interface and provide the `{{{}_metadata`{}}} column for its 
> underlying storage format, such as Parquet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44886) Introduce CLUSTER BY SQL clause to CREATE/REPLACE TABLE

2023-11-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-44886.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42577
[https://github.com/apache/spark/pull/42577]

> Introduce CLUSTER BY SQL clause to CREATE/REPLACE TABLE
> ---
>
> Key: SPARK-44886
> URL: https://issues.apache.org/jira/browse/SPARK-44886
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This proposes to introduce CLUSTER BY clause to CREATE/REPLACE SQL syntax:
> {code:java}
> CREATE TABLE tbl(a int, b string) CLUSTER BY (a, b){code}
> This doesn't introduce a default implementation for clustering, but it's up 
> to the catalog/datasource implementation to utilize the clustering 
> information (e.g., Delta, Iceberg, etc.).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44886) Introduce CLUSTER BY SQL clause to CREATE/REPLACE TABLE

2023-11-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-44886:
---

Assignee: Terry Kim

> Introduce CLUSTER BY SQL clause to CREATE/REPLACE TABLE
> ---
>
> Key: SPARK-44886
> URL: https://issues.apache.org/jira/browse/SPARK-44886
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>  Labels: pull-request-available
>
> This proposes to introduce CLUSTER BY clause to CREATE/REPLACE SQL syntax:
> {code:java}
> CREATE TABLE tbl(a int, b string) CLUSTER BY (a, b){code}
> This doesn't introduce a default implementation for clustering, but it's up 
> to the catalog/datasource implementation to utilize the clustering 
> information (e.g., Delta, Iceberg, etc.).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45851) (Scala) Support different retry policies for connect client

2023-11-09 Thread Alice Sayutina (Jira)
Alice Sayutina created SPARK-45851:
--

 Summary: (Scala) Support different retry policies for connect 
client
 Key: SPARK-45851
 URL: https://issues.apache.org/jira/browse/SPARK-45851
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 4.0.0
Reporter: Alice Sayutina


Support multiple retry policies defined at the same time. Each policy 
determines which error types it can retry and how exactly.

For instance, networking errors should generally be retried differently that
remote resource being available.

Relevant python ticket: SPARK-45733




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45733) (Python) Support different retry policies for connect client

2023-11-09 Thread Alice Sayutina (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alice Sayutina updated SPARK-45733:
---
Summary: (Python) Support different retry policies for connect client  
(was: Classify errors into different classes and support different retry 
policies.)

> (Python) Support different retry policies for connect client
> 
>
> Key: SPARK-45733
> URL: https://issues.apache.org/jira/browse/SPARK-45733
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Alice Sayutina
>Priority: Major
>  Labels: pull-request-available
>
> Support multiple retry policies defined at the same time. Each policy 
> determines which error types it can retry and how exactly.
> For instance, networking errors should generally be retried differently that
> remote resource being available.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45850) Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc driver version

2023-11-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784335#comment-17784335
 ] 

ASF GitHub Bot commented on SPARK-45850:


User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/43662

> Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc 
> driver version 
> ---
>
> Key: SPARK-45850
> URL: https://issues.apache.org/jira/browse/SPARK-45850
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45850) Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc driver version

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45850:
--

Assignee: Apache Spark

> Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc 
> driver version 
> ---
>
> Key: SPARK-45850
> URL: https://issues.apache.org/jira/browse/SPARK-45850
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45850) Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc driver version

2023-11-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45850:
--

Assignee: (was: Apache Spark)

> Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc 
> driver version 
> ---
>
> Key: SPARK-45850
> URL: https://issues.apache.org/jira/browse/SPARK-45850
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45282) Join loses records for cached datasets

2023-11-09 Thread Emil Ejbyfeldt (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784320#comment-17784320
 ] 

Emil Ejbyfeldt commented on SPARK-45282:


Created this [https://github.com/apache/spark/pull/43729] to backport the fix 
to 3.4 from my manual test it solved the reproduction in this ticket.

> Join loses records for cached datasets
> --
>
> Key: SPARK-45282
> URL: https://issues.apache.org/jira/browse/SPARK-45282
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
> Environment: spark 3.4.1 on apache hadoop 3.3.6 or kubernetes 1.26 or 
> databricks 13.3
>Reporter: koert kuipers
>Priority: Blocker
>  Labels: CorrectnessBug, correctness, pull-request-available
>
> we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is 
> not present on spark 3.3.1.
> it only shows up in distributed environment. i cannot replicate in unit test. 
> however i did get it to show up on hadoop cluster, kubernetes, and on 
> databricks 13.3
> the issue is that records are dropped when two cached dataframes are joined. 
> it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an 
> optimization while in spark 3.3.1 these Exhanges are still present. it seems 
> to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true.
> to reproduce on distributed cluster these settings needed are:
> {code:java}
> spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432
> spark.sql.adaptive.coalescePartitions.parallelismFirst false
> spark.sql.adaptive.enabled true
> spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code}
> code using scala to reproduce is:
> {code:java}
> import java.util.UUID
> import org.apache.spark.sql.functions.col
> import spark.implicits._
> val data = (1 to 100).toDS().map(i => 
> UUID.randomUUID().toString).persist()
> val left = data.map(k => (k, 1))
> val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works!
> println("number of left " + left.count())
> println("number of right " + right.count())
> println("number of (left join right) " +
>   left.toDF("key", "value1").join(right.toDF("key", "value2"), "key").count()
> )
> val left1 = left
>   .toDF("key", "value1")
>   .repartition(col("key")) // comment out this line to make it work
>   .persist()
> println("number of left1 " + left1.count())
> val right1 = right
>   .toDF("key", "value2")
>   .repartition(col("key")) // comment out this line to make it work
>   .persist()
> println("number of right1 " + right1.count())
> println("number of (left1 join right1) " +  left1.join(right1, 
> "key").count()) // this gives incorrect result{code}
> this produces the following output:
> {code:java}
> number of left 100
> number of right 100
> number of (left join right) 100
> number of left1 100
> number of right1 100
> number of (left1 join right1) 859531 {code}
> note that the last number (the incorrect one) actually varies depending on 
> settings and cluster size etc.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >