[jira] [Updated] (SPARK-46954) XML: Perf optimizations

2024-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46954:
---
Labels: pull-request-available  (was: )

> XML: Perf optimizations
> ---
>
> Key: SPARK-46954
> URL: https://issues.apache.org/jira/browse/SPARK-46954
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46957) Migrated shuffle data files from the decommissioned node should be removed when job completed

2024-02-02 Thread Yu-Jhe Li (Jira)
Yu-Jhe Li created SPARK-46957:
-

 Summary: Migrated shuffle data files from the decommissioned node 
should be removed when job completed
 Key: SPARK-46957
 URL: https://issues.apache.org/jira/browse/SPARK-46957
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Yu-Jhe Li


Hi, we have a long-lived Spark application run on a standalone cluster on GCP 
and we are using spot instances. To reduce the impact of preempted instances, 
we have enabled node decommission to let the preempted node migrate its shuffle 
data to other instances before it is deleted by GCP.

However, we found the migrated shuffle data from the decommissioned node is 
never removed. (same behavior on spark-3.5)

*Reproduce steps:*

1. Start spark-shell with 3 executors and enable decommission on both 
driver/worker

 
{code:java}
start-worker.sh[3331]: Spark Command: 
/usr/lib/jvm/java-17-openjdk-amd64/bin/java -cp 
/opt/spark/conf/:/opt/spark/jars/* -Dspark.worker.cleanup.appDataTtl=1800 
-Dspark.decommission.enabled=true -Xmx1g org.apache.spark.deploy.worker.Worker 
--webui-port 8081 spark://master-01.com:7077 {code}
 

 
{code:java}
/opt/spark/bin/spark-shell --master spark://master-01.spark.com:7077 \
  --total-executor-cores 12 \
  --conf spark.decommission.enabled=true \
  --conf spark.storage.decommission.enabled=true \
  --conf spark.storage.decommission.shuffleBlocks.enabled=true \
  --conf spark.storage.decommission.rddBlocks.enabled=true{code}
 

2. Manually stop 1 worker during execution

 
{code:java}
(1 to 10).foreach { i =>
  println(s"start iter $i ...")
  val longString = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Integer eget tortor id libero ultricies faucibus nec ac neque. Vivamus ac risus 
vitae mi efficitur lacinia. Quisque dignissim quam vel tellus placerat, non 
laoreet elit rhoncus. Nam et magna id dui tempor sagittis. Aliquam erat 
volutpat. Integer tristique purus ac eros bibendum, at varius velit viverra. 
Sed eleifend luctus massa, ac accumsan leo feugiat ac. Sed id nisl et enim 
tristique auctor. Sed vel ante nec leo placerat tincidunt. Ut varius, risus nec 
sodales tempor, odio augue euismod ipsum, nec tristique e"
  val df = (1 to 1 * i).map(j => (j, s"${j}_${longString}")).toDF("id", 
"mystr")

  df.repartition(6).count()
  System.gc()
  println(s"finished iter $i, wait 15s for next round")
  Thread.sleep(15*1000)
}
System.gc()

start iter 1 ...
finished iter 1, wait 15s for next round
... {code}
 

3. Check the migrated shuffle data files on the remaining workers

{*}decommissioned node{*}: migrated shuffle file successfully
{code:java}
less /mnt/spark_work/app-20240202084807-0003/1/stdout | grep 'Migrated '
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_4_41 to BlockManagerId(2, 10.67.5.139, 35949, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_4_38 to BlockManagerId(0, 10.67.5.134, 36175, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_4_47 to BlockManagerId(0, 10.67.5.134, 36175, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_4_44 to BlockManagerId(2, 10.67.5.139, 35949, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_5_52 to BlockManagerId(0, 10.67.5.134, 36175, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_5_55 to BlockManagerId(2, 10.67.5.139, 35949, None) {code}
{*}remaining shuffle data files on the other workers{*}: the migrated shuffle 
files are never removed

 
{code:java}
10.67.5.134 | CHANGED | rc=0 >>
-rw-r--r-- 1 spark spark 126 Feb  2 08:48 
/mnt/spark/spark-b25878b3-8b3c-4cff-ba4d-41f6d128da7c/executor-b8f83524-9270-4f35-83ca-ceb13af2b7d1/blockmgr-f05c4d8e-e1a5-4822-a6e9-49be760b67a2/13/shuffle_4_47_0.data
-rw-r--r-- 1 spark spark 126 Feb  2 08:48 
/mnt/spark/spark-b25878b3-8b3c-4cff-ba4d-41f6d128da7c/executor-b8f83524-9270-4f35-83ca-ceb13af2b7d1/blockmgr-f05c4d8e-e1a5-4822-a6e9-49be760b67a2/31/shuffle_4_38_0.data
-rw-r--r-- 1 spark spark 32 Feb  2 08:48 
/mnt/spark/spark-b25878b3-8b3c-4cff-ba4d-41f6d128da7c/executor-b8f83524-9270-4f35-83ca-ceb13af2b7d1/blockmgr-f05c4d8e-e1a5-4822-a6e9-49be760b67a2/3a/shuffle_5_52_0.data
10.67.5.139 | CHANGED | rc=0 >>
-rw-r--r-- 1 spark spark 126 Feb  2 08:48 
/mnt/spark/spark-ab501bec-ddd2-4b82-af3e-f2731066e580/executor-1ca5ad78-1d75-453d-88ab-487d7cdfacb7/blockmgr-f09eb18d-b0e4-48f9-a4ed-5587cef25a16/27/shuffle_4_41_0.data
-rw-r--r-- 1 spark spark 126 Feb  2 08:48 
/mnt/spark/spark-ab501bec-ddd2-4b82-af3e-f2731066e580/executor-1ca5ad78-1d75-453d-88ab-487d7cdfacb7/blockmgr-f09eb18d-b0e4-48f9-a4ed-5587cef25a16/36/shuffle_4_44_0.data
-rw-r--r-- 1 spark spark 32 Feb  2 08:48 
/mnt/spark/spark-ab501bec-ddd2-4b82-af3e-f2731066e580/executor-1ca5ad78-1d75-453d-88ab-487d7cd

[jira] [Updated] (SPARK-46957) Migrated shuffle data files from the decommissioned node should be removed when job completed

2024-02-02 Thread Yu-Jhe Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu-Jhe Li updated SPARK-46957:
--
Description: 
Hi, we have a long-lived Spark application run on a standalone cluster on GCP 
and we are using spot instances. To reduce the impact of preempted instances, 
we have enabled node decommission to let the preempted node migrate its shuffle 
data to other instances before it is deleted by GCP.

However, we found the migrated shuffle data from the decommissioned node is 
never removed. (same behavior on spark-3.5)

*Reproduce steps:*

1. Start spark-shell with 3 executors and enable decommission on both 
driver/worker
{noformat}
start-worker.sh[3331]: Spark Command: 
/usr/lib/jvm/java-17-openjdk-amd64/bin/java -cp 
/opt/spark/conf/:/opt/spark/jars/* -Dspark.worker.cleanup.appDataTtl=1800 
-Dspark.decommission.enabled=true -Xmx1g org.apache.spark.deploy.worker.Worker 
--webui-port 8081 spark://master-01.com:7077 {noformat}
{code:java}
start-worker.sh[3331]: Spark Command: 
/usr/lib/jvm/java-17-openjdk-amd64/bin/java -cp 
/opt/spark/conf/:/opt/spark/jars/* -Dspark.worker.cleanup.appDataTtl=1800 
-Dspark.decommission.enabled=true -Xmx1g org.apache.spark.deploy.worker.Worker 
--webui-port 8081 spark://master-01.com:7077 {code}
 

 
{code:java}
/opt/spark/bin/spark-shell --master spark://master-01.spark.com:7077 \
  --total-executor-cores 12 \
  --conf spark.decommission.enabled=true \
  --conf spark.storage.decommission.enabled=true \
  --conf spark.storage.decommission.shuffleBlocks.enabled=true \
  --conf spark.storage.decommission.rddBlocks.enabled=true{code}
 

2. Manually stop 1 worker during execution

 
{code:java}
(1 to 10).foreach { i =>
  println(s"start iter $i ...")
  val longString = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Integer eget tortor id libero ultricies faucibus nec ac neque. Vivamus ac risus 
vitae mi efficitur lacinia. Quisque dignissim quam vel tellus placerat, non 
laoreet elit rhoncus. Nam et magna id dui tempor sagittis. Aliquam erat 
volutpat. Integer tristique purus ac eros bibendum, at varius velit viverra. 
Sed eleifend luctus massa, ac accumsan leo feugiat ac. Sed id nisl et enim 
tristique auctor. Sed vel ante nec leo placerat tincidunt. Ut varius, risus nec 
sodales tempor, odio augue euismod ipsum, nec tristique e"
  val df = (1 to 1 * i).map(j => (j, s"${j}_${longString}")).toDF("id", 
"mystr")

  df.repartition(6).count()
  System.gc()
  println(s"finished iter $i, wait 15s for next round")
  Thread.sleep(15*1000)
}
System.gc()

start iter 1 ...
finished iter 1, wait 15s for next round
... {code}
 

3. Check the migrated shuffle data files on the remaining workers

{*}decommissioned node{*}: migrated shuffle file successfully
{code:java}
less /mnt/spark_work/app-20240202084807-0003/1/stdout | grep 'Migrated '
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_4_41 to BlockManagerId(2, 10.67.5.139, 35949, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_4_38 to BlockManagerId(0, 10.67.5.134, 36175, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_4_47 to BlockManagerId(0, 10.67.5.134, 36175, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_4_44 to BlockManagerId(2, 10.67.5.139, 35949, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_5_52 to BlockManagerId(0, 10.67.5.134, 36175, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_5_55 to BlockManagerId(2, 10.67.5.139, 35949, None) {code}
{*}remaining shuffle data files on the other workers{*}: the migrated shuffle 
files are never removed

 
{code:java}
10.67.5.134 | CHANGED | rc=0 >>
-rw-r--r-- 1 spark spark 126 Feb  2 08:48 
/mnt/spark/spark-b25878b3-8b3c-4cff-ba4d-41f6d128da7c/executor-b8f83524-9270-4f35-83ca-ceb13af2b7d1/blockmgr-f05c4d8e-e1a5-4822-a6e9-49be760b67a2/13/shuffle_4_47_0.data
-rw-r--r-- 1 spark spark 126 Feb  2 08:48 
/mnt/spark/spark-b25878b3-8b3c-4cff-ba4d-41f6d128da7c/executor-b8f83524-9270-4f35-83ca-ceb13af2b7d1/blockmgr-f05c4d8e-e1a5-4822-a6e9-49be760b67a2/31/shuffle_4_38_0.data
-rw-r--r-- 1 spark spark 32 Feb  2 08:48 
/mnt/spark/spark-b25878b3-8b3c-4cff-ba4d-41f6d128da7c/executor-b8f83524-9270-4f35-83ca-ceb13af2b7d1/blockmgr-f05c4d8e-e1a5-4822-a6e9-49be760b67a2/3a/shuffle_5_52_0.data
10.67.5.139 | CHANGED | rc=0 >>
-rw-r--r-- 1 spark spark 126 Feb  2 08:48 
/mnt/spark/spark-ab501bec-ddd2-4b82-af3e-f2731066e580/executor-1ca5ad78-1d75-453d-88ab-487d7cdfacb7/blockmgr-f09eb18d-b0e4-48f9-a4ed-5587cef25a16/27/shuffle_4_41_0.data
-rw-r--r-- 1 spark spark 126 Feb  2 08:48 
/mnt/spark/spark-ab501bec-ddd2-4b82-af3e-f2731066e580/executor-1ca5ad78-1d75-453d-88ab-487d7cdfacb7/blockmgr-f09eb18d-b0e4-48f9-a4ed-5587cef25a16/36/shuffle_4_44_0.data
-rw-r--r-- 1 spark spark 32 Feb  2 08:48 
/mnt/spark/spar

[jira] [Updated] (SPARK-46957) Migrated shuffle data files from the decommissioned node should be removed when job completed

2024-02-02 Thread Yu-Jhe Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu-Jhe Li updated SPARK-46957:
--
Description: 
Hi, we have a long-lived Spark application run on a standalone cluster on GCP 
and we are using spot instances. To reduce the impact of preempted instances, 
we have enabled node decommission to let the preempted node migrate its shuffle 
data to other instances before it is deleted by GCP.

However, we found the migrated shuffle data from the decommissioned node is 
never removed. (same behavior on spark-3.5)

*Reproduce steps:*

1. Start spark-shell with 3 executors and enable decommission on both 
driver/worker
{code:java}
start-worker.sh[3331]: Spark Command: 
/usr/lib/jvm/java-17-openjdk-amd64/bin/java -cp 
/opt/spark/conf/:/opt/spark/jars/* -Dspark.worker.cleanup.appDataTtl=1800 
-Dspark.decommission.enabled=true -Xmx1g org.apache.spark.deploy.worker.Worker 
--webui-port 8081 spark://master-01.com:7077 {code}
{code:java}
/opt/spark/bin/spark-shell --master spark://master-01.spark.com:7077 \
  --total-executor-cores 12 \
  --conf spark.decommission.enabled=true \
  --conf spark.storage.decommission.enabled=true \
  --conf spark.storage.decommission.shuffleBlocks.enabled=true \
  --conf spark.storage.decommission.rddBlocks.enabled=true{code}
 

2. Manually stop 1 worker during execution
{code:java}
(1 to 10).foreach { i =>
  println(s"start iter $i ...")
  val longString = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Integer eget tortor id libero ultricies faucibus nec ac neque. Vivamus ac risus 
vitae mi efficitur lacinia. Quisque dignissim quam vel tellus placerat, non 
laoreet elit rhoncus. Nam et magna id dui tempor sagittis. Aliquam erat 
volutpat. Integer tristique purus ac eros bibendum, at varius velit viverra. 
Sed eleifend luctus massa, ac accumsan leo feugiat ac. Sed id nisl et enim 
tristique auctor. Sed vel ante nec leo placerat tincidunt. Ut varius, risus nec 
sodales tempor, odio augue euismod ipsum, nec tristique e"
  val df = (1 to 1 * i).map(j => (j, s"${j}_${longString}")).toDF("id", 
"mystr")

  df.repartition(6).count()
  System.gc()
  println(s"finished iter $i, wait 15s for next round")
  Thread.sleep(15*1000)
}
System.gc()

start iter 1 ...
finished iter 1, wait 15s for next round
... {code}
 

3. Check the migrated shuffle data files on the remaining workers

{*}decommissioned node{*}: migrated shuffle file successfully
{code:java}
less /mnt/spark_work/app-20240202084807-0003/1/stdout | grep 'Migrated '
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_4_41 to BlockManagerId(2, 10.67.5.139, 35949, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_4_38 to BlockManagerId(0, 10.67.5.134, 36175, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_4_47 to BlockManagerId(0, 10.67.5.134, 36175, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_4_44 to BlockManagerId(2, 10.67.5.139, 35949, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_5_52 to BlockManagerId(0, 10.67.5.134, 36175, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_5_55 to BlockManagerId(2, 10.67.5.139, 35949, None) {code}
{*}remaining shuffle data files on the other workers{*}: the migrated shuffle 
files are never removed
{code:java}
10.67.5.134 | CHANGED | rc=0 >>
-rw-r--r-- 1 spark spark 126 Feb  2 08:48 
/mnt/spark/spark-b25878b3-8b3c-4cff-ba4d-41f6d128da7c/executor-b8f83524-9270-4f35-83ca-ceb13af2b7d1/blockmgr-f05c4d8e-e1a5-4822-a6e9-49be760b67a2/13/shuffle_4_47_0.data
-rw-r--r-- 1 spark spark 126 Feb  2 08:48 
/mnt/spark/spark-b25878b3-8b3c-4cff-ba4d-41f6d128da7c/executor-b8f83524-9270-4f35-83ca-ceb13af2b7d1/blockmgr-f05c4d8e-e1a5-4822-a6e9-49be760b67a2/31/shuffle_4_38_0.data
-rw-r--r-- 1 spark spark 32 Feb  2 08:48 
/mnt/spark/spark-b25878b3-8b3c-4cff-ba4d-41f6d128da7c/executor-b8f83524-9270-4f35-83ca-ceb13af2b7d1/blockmgr-f05c4d8e-e1a5-4822-a6e9-49be760b67a2/3a/shuffle_5_52_0.data
10.67.5.139 | CHANGED | rc=0 >>
-rw-r--r-- 1 spark spark 126 Feb  2 08:48 
/mnt/spark/spark-ab501bec-ddd2-4b82-af3e-f2731066e580/executor-1ca5ad78-1d75-453d-88ab-487d7cdfacb7/blockmgr-f09eb18d-b0e4-48f9-a4ed-5587cef25a16/27/shuffle_4_41_0.data
-rw-r--r-- 1 spark spark 126 Feb  2 08:48 
/mnt/spark/spark-ab501bec-ddd2-4b82-af3e-f2731066e580/executor-1ca5ad78-1d75-453d-88ab-487d7cdfacb7/blockmgr-f09eb18d-b0e4-48f9-a4ed-5587cef25a16/36/shuffle_4_44_0.data
-rw-r--r-- 1 spark spark 32 Feb  2 08:48 
/mnt/spark/spark-ab501bec-ddd2-4b82-af3e-f2731066e580/executor-1ca5ad78-1d75-453d-88ab-487d7cdfacb7/blockmgr-f09eb18d-b0e4-48f9-a4ed-5587cef25a16/29/shuffle_5_55_0.data
 {code}
 

 

  was:
Hi, we have a long-lived Spark application run on a standalone cluster on GCP 
and we are using spot instances. To reduce the impact of preempted 

[jira] [Updated] (SPARK-46957) Migrated shuffle data files from the decommissioned node should be removed when job completed

2024-02-02 Thread Yu-Jhe Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu-Jhe Li updated SPARK-46957:
--
Description: 
Hi, we have a long-lived Spark application run on a standalone cluster on GCP 
and we are using spot instances. To reduce the impact of preempted instances, 
we have enabled node decommission to let the preempted node migrate its shuffle 
data to other instances before it is deleted by GCP.

However, we found the migrated shuffle data from the decommissioned node is 
never removed. (same behavior on spark-3.5)

*Reproduce steps:*

1. Start spark-shell with 3 executors and enable decommission on both 
driver/worker
{code:java}
start-worker.sh[3331]: Spark Command: 
/usr/lib/jvm/java-17-openjdk-amd64/bin/java -cp 
/opt/spark/conf/:/opt/spark/jars/* -Dspark.worker.cleanup.appDataTtl=1800 
-Dspark.decommission.enabled=true -Xmx1g org.apache.spark.deploy.worker.Worker 
--webui-port 8081 spark://master-01.com:7077 {code}
{code:java}
/opt/spark/bin/spark-shell --master spark://master-01.spark.com:7077 \
  --total-executor-cores 12 \
  --conf spark.decommission.enabled=true \
  --conf spark.storage.decommission.enabled=true \
  --conf spark.storage.decommission.shuffleBlocks.enabled=true \
  --conf spark.storage.decommission.rddBlocks.enabled=true{code}
 

2. Manually stop 1 worker during execution
{code:java}
(1 to 10).foreach { i =>
  println(s"start iter $i ...")
  val longString = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Integer eget tortor id libero ultricies faucibus nec ac neque. Vivamus ac risus 
vitae mi efficitur lacinia. Quisque dignissim quam vel tellus placerat, non 
laoreet elit rhoncus. Nam et magna id dui tempor sagittis. Aliquam erat 
volutpat. Integer tristique purus ac eros bibendum, at varius velit viverra. 
Sed eleifend luctus massa, ac accumsan leo feugiat ac. Sed id nisl et enim 
tristique auctor. Sed vel ante nec leo placerat tincidunt. Ut varius, risus nec 
sodales tempor, odio augue euismod ipsum, nec tristique e"
  val df = (1 to 1 * i).map(j => (j, s"${j}_${longString}")).toDF("id", 
"mystr")

  df.repartition(6).count()
  System.gc()
  println(s"finished iter $i, wait 15s for next round")
  Thread.sleep(15*1000)
}
System.gc()

start iter 1 ...
finished iter 1, wait 15s for next round
... {code}
 

3. Check the migrated shuffle data files on the remaining workers

{*}decommissioned node{*}: migrated shuffle file successfully
{code:java}
less /mnt/spark_work/app-20240202084807-0003/1/stdout | grep 'Migrated '
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_4_41 to BlockManagerId(2, 10.67.5.139, 35949, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_4_38 to BlockManagerId(0, 10.67.5.134, 36175, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_4_47 to BlockManagerId(0, 10.67.5.134, 36175, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_4_44 to BlockManagerId(2, 10.67.5.139, 35949, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_5_52 to BlockManagerId(0, 10.67.5.134, 36175, None)
24/02/02 08:48:53 INFO BlockManagerDecommissioner: Migrated 
migrate_shuffle_5_55 to BlockManagerId(2, 10.67.5.139, 35949, None) {code}
{*}remaining shuffle data files on the other workers{*}: the migrated shuffle 
files are never removed
{code:java}
10.67.5.134 | CHANGED | rc=0 >>
-rw-r--r-- 1 spark spark 126 Feb  2 08:48 
/mnt/spark/spark-b25878b3-8b3c-4cff-ba4d-41f6d128da7c/executor-b8f83524-9270-4f35-83ca-ceb13af2b7d1/blockmgr-f05c4d8e-e1a5-4822-a6e9-49be760b67a2/13/shuffle_4_47_0.data
-rw-r--r-- 1 spark spark 126 Feb  2 08:48 
/mnt/spark/spark-b25878b3-8b3c-4cff-ba4d-41f6d128da7c/executor-b8f83524-9270-4f35-83ca-ceb13af2b7d1/blockmgr-f05c4d8e-e1a5-4822-a6e9-49be760b67a2/31/shuffle_4_38_0.data
-rw-r--r-- 1 spark spark 32 Feb  2 08:48 
/mnt/spark/spark-b25878b3-8b3c-4cff-ba4d-41f6d128da7c/executor-b8f83524-9270-4f35-83ca-ceb13af2b7d1/blockmgr-f05c4d8e-e1a5-4822-a6e9-49be760b67a2/3a/shuffle_5_52_0.data
10.67.5.139 | CHANGED | rc=0 >>
-rw-r--r-- 1 spark spark 126 Feb  2 08:48 
/mnt/spark/spark-ab501bec-ddd2-4b82-af3e-f2731066e580/executor-1ca5ad78-1d75-453d-88ab-487d7cdfacb7/blockmgr-f09eb18d-b0e4-48f9-a4ed-5587cef25a16/27/shuffle_4_41_0.data
-rw-r--r-- 1 spark spark 126 Feb  2 08:48 
/mnt/spark/spark-ab501bec-ddd2-4b82-af3e-f2731066e580/executor-1ca5ad78-1d75-453d-88ab-487d7cdfacb7/blockmgr-f09eb18d-b0e4-48f9-a4ed-5587cef25a16/36/shuffle_4_44_0.data
-rw-r--r-- 1 spark spark 32 Feb  2 08:48 
/mnt/spark/spark-ab501bec-ddd2-4b82-af3e-f2731066e580/executor-1ca5ad78-1d75-453d-88ab-487d7cdfacb7/blockmgr-f09eb18d-b0e4-48f9-a4ed-5587cef25a16/29/shuffle_5_55_0.data
 {code}
 

*Expected behavior:*

The migrated shuffle data files should be removed after job completed

  was:
Hi, we have a long-lived Spark application run on a stan

[jira] [Commented] (SPARK-20624) SPIP: Add better handling for node shutdown

2024-02-02 Thread Yu-Jhe Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813592#comment-17813592
 ] 

Yu-Jhe Li commented on SPARK-20624:
---

Hi, we found the migrated shuffle files from the decommissioned node are never 
deleted even if the job had been completed for a long time. 

I have created an issue https://issues.apache.org/jira/browse/SPARK-46957 to 
address this issue. Can anyone help?

> SPIP: Add better handling for node shutdown
> ---
>
> Key: SPARK-20624
> URL: https://issues.apache.org/jira/browse/SPARK-20624
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Priority: Major
>
> While we've done some good work with better handling when Spark is choosing 
> to decommission nodes (SPARK-7955), it might make sense in environments where 
> we get preempted without our own choice (e.g. YARN over-commit, EC2 spot 
> instances, GCE Preemptiable instances, etc.) to do something for the data on 
> the node (or at least not schedule any new tasks).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46949) Support CHAR/VARCHAR through ResolveDefaultColumns

2024-02-02 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-46949.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44991
[https://github.com/apache/spark/pull/44991]

> Support CHAR/VARCHAR through  ResolveDefaultColumns
> ---
>
> Key: SPARK-46949
> URL: https://issues.apache.org/jira/browse/SPARK-46949
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46949) Support CHAR/VARCHAR through ResolveDefaultColumns

2024-02-02 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-46949:


Assignee: Kent Yao

> Support CHAR/VARCHAR through  ResolveDefaultColumns
> ---
>
> Key: SPARK-46949
> URL: https://issues.apache.org/jira/browse/SPARK-46949
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46958) FIx coerceDefaultValue when canUpCast

2024-02-02 Thread Kent Yao (Jira)
Kent Yao created SPARK-46958:


 Summary: FIx coerceDefaultValue when canUpCast
 Key: SPARK-46958
 URL: https://issues.apache.org/jira/browse/SPARK-46958
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao


```
create table src(key int, c string DEFAULT date'2018-11-17') using parquet;
Time taken: 0.133 seconds
spark-sql (default)> desc src;
[INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. 
You hit a bug in Spark or the Spark plugins you use. Please, report this bug to 
the corresponding communities or vendors, and provide the full stack trace.
org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase analysis 
failed with an internal error. You hit a bug in Spark or the Spark plugins you 
use. Please, report this bug to the corresponding communities or vendors, and 
provide the full stack trace.
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46958) FIx coerceDefaultValue when canUpCast

2024-02-02 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-46958:
-
Affects Version/s: 3.5.0

> FIx coerceDefaultValue when canUpCast
> -
>
> Key: SPARK-46958
> URL: https://issues.apache.org/jira/browse/SPARK-46958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Kent Yao
>Priority: Major
>
> ```
> create table src(key int, c string DEFAULT date'2018-11-17') using parquet;
> Time taken: 0.133 seconds
> spark-sql (default)> desc src;
> [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. 
> You hit a bug in Spark or the Spark plugins you use. Please, report this bug 
> to the corresponding communities or vendors, and provide the full stack trace.
> org.apache.spark.SparkException: [INTERNAL_ERROR] The Spark SQL phase 
> analysis failed with an internal error. You hit a bug in Spark or the Spark 
> plugins you use. Please, report this bug to the corresponding communities or 
> vendors, and provide the full stack trace.
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46946) Supporting broadcast of multiple filtering keys in DynamicPruning

2024-02-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46946:
---

Assignee: Thang Long Vu

> Supporting broadcast of multiple filtering keys in DynamicPruning
> -
>
> Key: SPARK-46946
> URL: https://issues.apache.org/jira/browse/SPARK-46946
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Thang Long Vu
>Assignee: Thang Long Vu
>Priority: Major
>  Labels: pull-request-available, releasenotes
>
> This PR extends `DynamicPruningSubquery` to support broadcasting of multiple 
> filtering keys (instead of one as before). The majority of the PR is to 
> simply generalise singularity to plurality.
> Note: We actually do not use the multiple filtering keys 
> `DynamicPruningSubquery` in this PR, we are doing this to make supporting DPP 
> Null Safe Equality or multiple Equality predicates easier in the future.
> In Null Safe Equality JOIN, the JOIN condition `a <=> b` is transformed to 
> `Coalesce(key1, Literal(key1.dataType)) = Coalesce(key2, 
> Literal(key2.dataType)) AND IsNull(key1) = IsNull(key2)`. In order to have 
> the highest pruning efficiency, we broadcast the 2 keys `Coalesce(key, 
> Literal(key.dataType))` and `IsNull(key)` and use them to prune the other 
> side at the same time. 
> Before, the `DynamicPruningSubquery` only has one broadcasting key and we 
> only supports DPP for one `EqualTo` JOIN predicate, now we are extending the 
> subquery to multiple broadcasting keys. Please note that DPP has not been 
> supported for multiple JOIN predicates. 
> Put it in another way, at the moment, we don't insert a DPP Filter for 
> multiple JOIN predicates at the same time, only potentially insert a DPP 
> Filter for a given Equality JOIN predicate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46946) Supporting broadcast of multiple filtering keys in DynamicPruning

2024-02-02 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46946.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44988
[https://github.com/apache/spark/pull/44988]

> Supporting broadcast of multiple filtering keys in DynamicPruning
> -
>
> Key: SPARK-46946
> URL: https://issues.apache.org/jira/browse/SPARK-46946
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Thang Long Vu
>Assignee: Thang Long Vu
>Priority: Major
>  Labels: pull-request-available, releasenotes
> Fix For: 4.0.0
>
>
> This PR extends `DynamicPruningSubquery` to support broadcasting of multiple 
> filtering keys (instead of one as before). The majority of the PR is to 
> simply generalise singularity to plurality.
> Note: We actually do not use the multiple filtering keys 
> `DynamicPruningSubquery` in this PR, we are doing this to make supporting DPP 
> Null Safe Equality or multiple Equality predicates easier in the future.
> In Null Safe Equality JOIN, the JOIN condition `a <=> b` is transformed to 
> `Coalesce(key1, Literal(key1.dataType)) = Coalesce(key2, 
> Literal(key2.dataType)) AND IsNull(key1) = IsNull(key2)`. In order to have 
> the highest pruning efficiency, we broadcast the 2 keys `Coalesce(key, 
> Literal(key.dataType))` and `IsNull(key)` and use them to prune the other 
> side at the same time. 
> Before, the `DynamicPruningSubquery` only has one broadcasting key and we 
> only supports DPP for one `EqualTo` JOIN predicate, now we are extending the 
> subquery to multiple broadcasting keys. Please note that DPP has not been 
> supported for multiple JOIN predicates. 
> Put it in another way, at the moment, we don't insert a DPP Filter for 
> multiple JOIN predicates at the same time, only potentially insert a DPP 
> Filter for a given Equality JOIN predicate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46959) CSV reader reads data inconsistently depending on column position

2024-02-02 Thread Martin Rueckl (Jira)
Martin Rueckl created SPARK-46959:
-

 Summary: CSV reader reads data inconsistently depending on column 
position
 Key: SPARK-46959
 URL: https://issues.apache.org/jira/browse/SPARK-46959
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.4.1
Reporter: Martin Rueckl


Reading the following CSV
{code:java}
"a";"b";"c";"d"
10;100,00;"Some;String";"ok"
20;200,00;"";"still ok"
30;300,00;"also ok";""
40;400,00;"";"" {code}
with these options
{code:java}
spark.read
.option("header","true")
.option("sep",";")
.option("encoding","ISO-8859-1")
.option("lineSep","\r\n")
.option("nullValue","")
.option("quote",'"')
.option("escape","") {code}
results in the followin inconsistent dataframe
!image-2024-02-02-13-05-26-203.png|width=352,height=120!

As one can see, the quoted empty fields of the last column are not correctly 
read as null, whereas it works for column c.

If I recall correctly, this only happens when the "escape" option is set to an 
empty string. Not setting it to "" (defaults to "\") seems to not cause this 
bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46959) CSV reader reads data inconsistently depending on column position

2024-02-02 Thread Martin Rueckl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Rueckl updated SPARK-46959:
--
Description: 
Reading the following CSV
{code:java}
"a";"b";"c";"d"
10;100,00;"Some;String";"ok"
20;200,00;"";"still ok"
30;300,00;"also ok";""
40;400,00;"";"" {code}
with these options
{code:java}
spark.read
.option("header","true")
.option("sep",";")
.option("encoding","ISO-8859-1")
.option("lineSep","\r\n")
.option("nullValue","")
.option("quote",'"')
.option("escape","") {code}
results in the followin inconsistent dataframe

 

As one can see, the quoted empty fields of the last column are not correctly 
read as null, whereas it works for column c.

If I recall correctly, this only happens when the "escape" option is set to an 
empty string. Not setting it to "" (defaults to "\") seems to not cause this 
bug.

  was:
Reading the following CSV
{code:java}
"a";"b";"c";"d"
10;100,00;"Some;String";"ok"
20;200,00;"";"still ok"
30;300,00;"also ok";""
40;400,00;"";"" {code}
with these options
{code:java}
spark.read
.option("header","true")
.option("sep",";")
.option("encoding","ISO-8859-1")
.option("lineSep","\r\n")
.option("nullValue","")
.option("quote",'"')
.option("escape","") {code}
results in the followin inconsistent dataframe
!image-2024-02-02-13-05-26-203.png|width=352,height=120!

As one can see, the quoted empty fields of the last column are not correctly 
read as null, whereas it works for column c.

If I recall correctly, this only happens when the "escape" option is set to an 
empty string. Not setting it to "" (defaults to "\") seems to not cause this 
bug.


> CSV reader reads data inconsistently depending on column position
> -
>
> Key: SPARK-46959
> URL: https://issues.apache.org/jira/browse/SPARK-46959
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Martin Rueckl
>Priority: Critical
>
> Reading the following CSV
> {code:java}
> "a";"b";"c";"d"
> 10;100,00;"Some;String";"ok"
> 20;200,00;"";"still ok"
> 30;300,00;"also ok";""
> 40;400,00;"";"" {code}
> with these options
> {code:java}
> spark.read
> .option("header","true")
> .option("sep",";")
> .option("encoding","ISO-8859-1")
> .option("lineSep","\r\n")
> .option("nullValue","")
> .option("quote",'"')
> .option("escape","") {code}
> results in the followin inconsistent dataframe
>  
> As one can see, the quoted empty fields of the last column are not correctly 
> read as null, whereas it works for column c.
> If I recall correctly, this only happens when the "escape" option is set to 
> an empty string. Not setting it to "" (defaults to "\") seems to not cause 
> this bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46959) CSV reader reads data inconsistently depending on column position

2024-02-02 Thread Martin Rueckl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Rueckl updated SPARK-46959:
--
Description: 
Reading the following CSV
{code:java}
"a";"b";"c";"d"
10;100,00;"Some;String";"ok"
20;200,00;"";"still ok"
30;300,00;"also ok";""
40;400,00;"";"" {code}
with these options
{code:java}
spark.read
.option("header","true")
.option("sep",";")
.option("encoding","ISO-8859-1")
.option("lineSep","\r\n")
.option("nullValue","")
.option("quote",'"')
.option("escape","") {code}
results in the followin inconsistent dataframe

 
||a||b||c||d||
|10|100,00|Some;String|ok|
|20|200,00||still ok|
|30|300,00|also ok|"|
|40|400,00||"|

 

 

As one can see, the quoted empty fields of the last column are not correctly 
read as null, whereas it works for column c.

If I recall correctly, this only happens when the "escape" option is set to an 
empty string. Not setting it to "" (defaults to "\") seems to not cause this 
bug.

  was:
Reading the following CSV
{code:java}
"a";"b";"c";"d"
10;100,00;"Some;String";"ok"
20;200,00;"";"still ok"
30;300,00;"also ok";""
40;400,00;"";"" {code}
with these options
{code:java}
spark.read
.option("header","true")
.option("sep",";")
.option("encoding","ISO-8859-1")
.option("lineSep","\r\n")
.option("nullValue","")
.option("quote",'"')
.option("escape","") {code}
results in the followin inconsistent dataframe

 
||a||b||c||d||
|10|100,00|Some;String|ok|
|20|200,00||still ok|
|30|300,00|also ok|"|
|40|400,00||"|
| | | | |

 

 

As one can see, the quoted empty fields of the last column are not correctly 
read as null, whereas it works for column c.

If I recall correctly, this only happens when the "escape" option is set to an 
empty string. Not setting it to "" (defaults to "\") seems to not cause this 
bug.


> CSV reader reads data inconsistently depending on column position
> -
>
> Key: SPARK-46959
> URL: https://issues.apache.org/jira/browse/SPARK-46959
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Martin Rueckl
>Priority: Critical
>
> Reading the following CSV
> {code:java}
> "a";"b";"c";"d"
> 10;100,00;"Some;String";"ok"
> 20;200,00;"";"still ok"
> 30;300,00;"also ok";""
> 40;400,00;"";"" {code}
> with these options
> {code:java}
> spark.read
> .option("header","true")
> .option("sep",";")
> .option("encoding","ISO-8859-1")
> .option("lineSep","\r\n")
> .option("nullValue","")
> .option("quote",'"')
> .option("escape","") {code}
> results in the followin inconsistent dataframe
>  
> ||a||b||c||d||
> |10|100,00|Some;String|ok|
> |20|200,00||still ok|
> |30|300,00|also ok|"|
> |40|400,00||"|
>  
>  
> As one can see, the quoted empty fields of the last column are not correctly 
> read as null, whereas it works for column c.
> If I recall correctly, this only happens when the "escape" option is set to 
> an empty string. Not setting it to "" (defaults to "\") seems to not cause 
> this bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46959) CSV reader reads data inconsistently depending on column position

2024-02-02 Thread Martin Rueckl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Rueckl updated SPARK-46959:
--
Description: 
Reading the following CSV
{code:java}
"a";"b";"c";"d"
10;100,00;"Some;String";"ok"
20;200,00;"";"still ok"
30;300,00;"also ok";""
40;400,00;"";"" {code}
with these options
{code:java}
spark.read
.option("header","true")
.option("sep",";")
.option("encoding","ISO-8859-1")
.option("lineSep","\r\n")
.option("nullValue","")
.option("quote",'"')
.option("escape","") {code}
results in the followin inconsistent dataframe

 
||a||b||c||d||
|10|100,00|Some;String|ok|
|20|200,00||still ok|
|30|300,00|also ok|"|
|40|400,00||"|
| | | | |

 

 

As one can see, the quoted empty fields of the last column are not correctly 
read as null, whereas it works for column c.

If I recall correctly, this only happens when the "escape" option is set to an 
empty string. Not setting it to "" (defaults to "\") seems to not cause this 
bug.

  was:
Reading the following CSV
{code:java}
"a";"b";"c";"d"
10;100,00;"Some;String";"ok"
20;200,00;"";"still ok"
30;300,00;"also ok";""
40;400,00;"";"" {code}
with these options
{code:java}
spark.read
.option("header","true")
.option("sep",";")
.option("encoding","ISO-8859-1")
.option("lineSep","\r\n")
.option("nullValue","")
.option("quote",'"')
.option("escape","") {code}
results in the followin inconsistent dataframe

 

As one can see, the quoted empty fields of the last column are not correctly 
read as null, whereas it works for column c.

If I recall correctly, this only happens when the "escape" option is set to an 
empty string. Not setting it to "" (defaults to "\") seems to not cause this 
bug.


> CSV reader reads data inconsistently depending on column position
> -
>
> Key: SPARK-46959
> URL: https://issues.apache.org/jira/browse/SPARK-46959
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Martin Rueckl
>Priority: Critical
>
> Reading the following CSV
> {code:java}
> "a";"b";"c";"d"
> 10;100,00;"Some;String";"ok"
> 20;200,00;"";"still ok"
> 30;300,00;"also ok";""
> 40;400,00;"";"" {code}
> with these options
> {code:java}
> spark.read
> .option("header","true")
> .option("sep",";")
> .option("encoding","ISO-8859-1")
> .option("lineSep","\r\n")
> .option("nullValue","")
> .option("quote",'"')
> .option("escape","") {code}
> results in the followin inconsistent dataframe
>  
> ||a||b||c||d||
> |10|100,00|Some;String|ok|
> |20|200,00||still ok|
> |30|300,00|also ok|"|
> |40|400,00||"|
> | | | | |
>  
>  
> As one can see, the quoted empty fields of the last column are not correctly 
> read as null, whereas it works for column c.
> If I recall correctly, this only happens when the "escape" option is set to 
> an empty string. Not setting it to "" (defaults to "\") seems to not cause 
> this bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46959) CSV reader reads data inconsistently depending on column position

2024-02-02 Thread Martin Rueckl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Rueckl updated SPARK-46959:
--
Description: 
Reading the following CSV
{code:java}
"a";"b";"c";"d"
10;100,00;"Some;String";"ok"
20;200,00;"";"still ok"
30;300,00;"also ok";""
40;400,00;"";"" {code}
with these options
{code:java}
spark.read
.option("header","true")
.option("sep",";")
.option("encoding","ISO-8859-1")
.option("lineSep","\r\n")
.option("nullValue","")
.option("quote",'"')
.option("escape","") {code}
results in the followin inconsistent dataframe

 
||a||b||c||d||
|10|100,00|Some;String|ok|
|20|200,00||still ok|
|30|300,00|also ok|"|
|40|400,00||"|

As one can see, the quoted empty fields of the last column are not correctly 
read as null but instead contain a single double quote. It works for column c.

If I recall correctly, this only happens when the "escape" option is set to an 
empty string. Not setting it to "" (defaults to "\") seems to not cause this 
bug.

  was:
Reading the following CSV
{code:java}
"a";"b";"c";"d"
10;100,00;"Some;String";"ok"
20;200,00;"";"still ok"
30;300,00;"also ok";""
40;400,00;"";"" {code}
with these options
{code:java}
spark.read
.option("header","true")
.option("sep",";")
.option("encoding","ISO-8859-1")
.option("lineSep","\r\n")
.option("nullValue","")
.option("quote",'"')
.option("escape","") {code}
results in the followin inconsistent dataframe

 
||a||b||c||d||
|10|100,00|Some;String|ok|
|20|200,00||still ok|
|30|300,00|also ok|"|
|40|400,00||"|

 

 

As one can see, the quoted empty fields of the last column are not correctly 
read as null, whereas it works for column c.

If I recall correctly, this only happens when the "escape" option is set to an 
empty string. Not setting it to "" (defaults to "\") seems to not cause this 
bug.


> CSV reader reads data inconsistently depending on column position
> -
>
> Key: SPARK-46959
> URL: https://issues.apache.org/jira/browse/SPARK-46959
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Martin Rueckl
>Priority: Critical
>
> Reading the following CSV
> {code:java}
> "a";"b";"c";"d"
> 10;100,00;"Some;String";"ok"
> 20;200,00;"";"still ok"
> 30;300,00;"also ok";""
> 40;400,00;"";"" {code}
> with these options
> {code:java}
> spark.read
> .option("header","true")
> .option("sep",";")
> .option("encoding","ISO-8859-1")
> .option("lineSep","\r\n")
> .option("nullValue","")
> .option("quote",'"')
> .option("escape","") {code}
> results in the followin inconsistent dataframe
>  
> ||a||b||c||d||
> |10|100,00|Some;String|ok|
> |20|200,00||still ok|
> |30|300,00|also ok|"|
> |40|400,00||"|
> As one can see, the quoted empty fields of the last column are not correctly 
> read as null but instead contain a single double quote. It works for column c.
> If I recall correctly, this only happens when the "escape" option is set to 
> an empty string. Not setting it to "" (defaults to "\") seems to not cause 
> this bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46959) CSV reader reads data inconsistently depending on column position

2024-02-02 Thread Martin Rueckl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Rueckl updated SPARK-46959:
--
Description: 
Reading the following CSV
{code:java}
"a";"b";"c";"d"
10;100,00;"Some;String";"ok"
20;200,00;"";"still ok"
30;300,00;"also ok";""
40;400,00;"";"" {code}
with these options
{code:java}
spark.read
.option("header","true")
.option("sep",";")
.option("encoding","ISO-8859-1")
.option("lineSep","\r\n")
.option("nullValue","")
.option("quote",'"')
.option("escape","") {code}
results in the followin inconsistent dataframe

 
||a||b||c||d||
|10|100,00|Some;String|ok|
|20|200,00||still ok|
|30|300,00|also ok|"|
|40|400,00||"|

As one can see, the quoted empty fields of the last column are not correctly 
read as null but instead contain a single double quote. It works for column c.

If I recall correctly, this only happens when the "escape" option is set to an 
empty string. Not setting it to "" (defaults to "\") seems to not cause this 
bug.

I observed this on databricks spark runtime 13.2 (think that is spark 3.4.1).

  was:
Reading the following CSV
{code:java}
"a";"b";"c";"d"
10;100,00;"Some;String";"ok"
20;200,00;"";"still ok"
30;300,00;"also ok";""
40;400,00;"";"" {code}
with these options
{code:java}
spark.read
.option("header","true")
.option("sep",";")
.option("encoding","ISO-8859-1")
.option("lineSep","\r\n")
.option("nullValue","")
.option("quote",'"')
.option("escape","") {code}
results in the followin inconsistent dataframe

 
||a||b||c||d||
|10|100,00|Some;String|ok|
|20|200,00||still ok|
|30|300,00|also ok|"|
|40|400,00||"|

As one can see, the quoted empty fields of the last column are not correctly 
read as null but instead contain a single double quote. It works for column c.

If I recall correctly, this only happens when the "escape" option is set to an 
empty string. Not setting it to "" (defaults to "\") seems to not cause this 
bug.


> CSV reader reads data inconsistently depending on column position
> -
>
> Key: SPARK-46959
> URL: https://issues.apache.org/jira/browse/SPARK-46959
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Martin Rueckl
>Priority: Critical
>
> Reading the following CSV
> {code:java}
> "a";"b";"c";"d"
> 10;100,00;"Some;String";"ok"
> 20;200,00;"";"still ok"
> 30;300,00;"also ok";""
> 40;400,00;"";"" {code}
> with these options
> {code:java}
> spark.read
> .option("header","true")
> .option("sep",";")
> .option("encoding","ISO-8859-1")
> .option("lineSep","\r\n")
> .option("nullValue","")
> .option("quote",'"')
> .option("escape","") {code}
> results in the followin inconsistent dataframe
>  
> ||a||b||c||d||
> |10|100,00|Some;String|ok|
> |20|200,00||still ok|
> |30|300,00|also ok|"|
> |40|400,00||"|
> As one can see, the quoted empty fields of the last column are not correctly 
> read as null but instead contain a single double quote. It works for column c.
> If I recall correctly, this only happens when the "escape" option is set to 
> an empty string. Not setting it to "" (defaults to "\") seems to not cause 
> this bug.
> I observed this on databricks spark runtime 13.2 (think that is spark 3.4.1).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46911) Add deleteIfExists operator to StatefulProcessorHandle

2024-02-02 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-46911.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44903
[https://github.com/apache/spark/pull/44903]

> Add deleteIfExists operator to StatefulProcessorHandle
> --
>
> Key: SPARK-46911
> URL: https://issues.apache.org/jira/browse/SPARK-46911
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Eric Marnadi
>Assignee: Eric Marnadi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Adding the {{deleteIfExists}} method to the {{StatefulProcessorHandle}} in 
> order to remove state variables from the State Store



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46911) Add deleteIfExists operator to StatefulProcessorHandle

2024-02-02 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-46911:


Assignee: Eric Marnadi

> Add deleteIfExists operator to StatefulProcessorHandle
> --
>
> Key: SPARK-46911
> URL: https://issues.apache.org/jira/browse/SPARK-46911
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Eric Marnadi
>Assignee: Eric Marnadi
>Priority: Major
>  Labels: pull-request-available
>
> Adding the {{deleteIfExists}} method to the {{StatefulProcessorHandle}} in 
> order to remove state variables from the State Store



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42399) CONV() silently overflows returning wrong results

2024-02-02 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-42399:
-
Labels: correctness pull-request-available  (was: pull-request-available)

> CONV() silently overflows returning wrong results
> -
>
> Key: SPARK-42399
> URL: https://issues.apache.org/jira/browse/SPARK-42399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Critical
>  Labels: correctness, pull-request-available
>
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 2.114 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.ansi.enabled = true;
> spark.sql.ansi.enabled true
> Time taken: 0.068 seconds, Fetched 1 row(s)
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 0.05 seconds, Fetched 1 row(s)
> In ANSI mode we should raise an error for sure.
> In non ANSI either an error or a NULL maybe be acceptable.
> Alternatively, of course, we could consider if we can support arbitrary 
> domains since the result is a STRING again. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42399) CONV() silently overflows returning wrong results

2024-02-02 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-42399:
-
Affects Version/s: 3.5.0

> CONV() silently overflows returning wrong results
> -
>
> Key: SPARK-42399
> URL: https://issues.apache.org/jira/browse/SPARK-42399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Serge Rielau
>Priority: Critical
>  Labels: correctness, pull-request-available
>
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 2.114 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.ansi.enabled = true;
> spark.sql.ansi.enabled true
> Time taken: 0.068 seconds, Fetched 1 row(s)
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 0.05 seconds, Fetched 1 row(s)
> In ANSI mode we should raise an error for sure.
> In non ANSI either an error or a NULL maybe be acceptable.
> Alternatively, of course, we could consider if we can support arbitrary 
> domains since the result is a STRING again. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42399) CONV() silently overflows returning wrong results

2024-02-02 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813733#comment-17813733
 ] 

Nicholas Chammas commented on SPARK-42399:
--

This issue does indeed appear to be resolved on {{master}} when ANSI mode is 
enabled:
{code:java}
>>> spark.sql(f"SELECT CONV('{'f' * 64}', 16, 10) AS 
>>> result").show(truncate=False)
++
|result              |
++
|18446744073709551615|
++
>>> spark.conf.set("spark.sql.ansi.enabled", "true")
>>> spark.sql(f"SELECT CONV('{'f' * 64}', 16, 10) AS 
>>> result").show(truncate=False)
Traceback (most recent call last):
...
pyspark.errors.exceptions.captured.ArithmeticException: [ARITHMETIC_OVERFLOW] 
Overflow in function conv(). If necessary set "spark.sql.ansi.enabled" to 
"false" to bypass this error. SQLSTATE: 22003
== SQL (line 1, position 8) ==
SELECT CONV('', 
16, 10) AS result
       

 {code}
However, there is still a silent overflow when ANSI mode is disabled. The error 
message suggests this is intended behavior.

cc [~gengliang] and [~gurwls223], who resolved SPARK-42427.

> CONV() silently overflows returning wrong results
> -
>
> Key: SPARK-42399
> URL: https://issues.apache.org/jira/browse/SPARK-42399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Serge Rielau
>Priority: Critical
>  Labels: correctness, pull-request-available
>
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 2.114 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.ansi.enabled = true;
> spark.sql.ansi.enabled true
> Time taken: 0.068 seconds, Fetched 1 row(s)
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 0.05 seconds, Fetched 1 row(s)
> In ANSI mode we should raise an error for sure.
> In non ANSI either an error or a NULL maybe be acceptable.
> Alternatively, of course, we could consider if we can support arbitrary 
> domains since the result is a STRING again. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42399) CONV() silently overflows returning wrong results

2024-02-02 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-42399:
-
Affects Version/s: (was: 3.5.0)

> CONV() silently overflows returning wrong results
> -
>
> Key: SPARK-42399
> URL: https://issues.apache.org/jira/browse/SPARK-42399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Critical
>  Labels: correctness, pull-request-available
>
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 2.114 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.ansi.enabled = true;
> spark.sql.ansi.enabled true
> Time taken: 0.068 seconds, Fetched 1 row(s)
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 0.05 seconds, Fetched 1 row(s)
> In ANSI mode we should raise an error for sure.
> In non ANSI either an error or a NULL maybe be acceptable.
> Alternatively, of course, we could consider if we can support arbitrary 
> domains since the result is a STRING again. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38167) CSV parsing error when using escape='"'

2024-02-02 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813741#comment-17813741
 ] 

Nicholas Chammas commented on SPARK-38167:
--

[~marnixvandenbroek] - Could you link to the bug report you filed with 
Univocity?

cc [~maxgekk] - I believe you have hit some parsing bugs in Univocity recently.

> CSV parsing error when using escape='"' 
> 
>
> Key: SPARK-38167
> URL: https://issues.apache.org/jira/browse/SPARK-38167
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.2.1
> Environment: Pyspark on a single-node Databricks managed Spark 3.1.2 
> cluster.
>Reporter: Marnix van den Broek
>Priority: Major
>  Labels: correctness, csv, csvparser, data-integrity
>
> hi all,
> When reading CSV files with Spark, I ran into a parsing bug.
> {*}The summary{*}:
> When
>  # reading a comma separated, double-quote quoted CSV file using the csv 
> reader options _escape='"'_ and {_}header=True{_},
>  # with a row containing a quoted empty field
>  # followed by a quoted field starting with a comma and followed by one or 
> more characters
> selecting columns from the dataframe at or after the field described in 3) 
> gives incorrect and inconsistent results
> {*}In detail{*}:
> When I instruct Spark to read this CSV file:
>  
> {code:java}
> col1,col2
> "",",a"
> {code}
>  
> using the CSV reader options escape='"' (unnecessary for the example, 
> necessary for the files I'm processing) and header=True, I expect the 
> following result:
>  
> {code:java}
> spark.read.csv(path, escape='"', header=True).show()
>  
> +++
> |col1|col2|
> +++
> |null|  ,a|
> +++   {code}
>  
>  Spark does yield this result, so far so good. However, when I select col2 
> from the dataframe, Spark yields an incorrect result:
>  
> {code:java}
> spark.read.csv(path, escape='"', header=True).select('col2').show()
>  
> ++
> |col2|
> ++
> |  a"|
> ++{code}
>  
> If you run this example with more columns in the file, and more commas in the 
> field, e.g. ",,,a", the problem compounds, as Spark shifts many values to 
> the right, causing unexpected and incorrect results. The inconsistency 
> between both methods surprised me, as it implies the parsing is evaluated 
> differently between both methods. 
> I expect the bug to be located in the quote-balancing and un-escaping methods 
> of the csv parser, but I can't find where that code is located in the code 
> base. I'd be happy to take a look at it if anyone can point me where it is. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45786) Inaccurate Decimal multiplication and division results

2024-02-02 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813766#comment-17813766
 ] 

Nicholas Chammas commented on SPARK-45786:
--

[~kazuyukitanimura] - I'm just curious: How did you find this bug? Was it 
something you stumbled on by accident or did you search for it using something 
like a fuzzer?

> Inaccurate Decimal multiplication and division results
> --
>
> Key: SPARK-45786
> URL: https://issues.apache.org/jira/browse/SPARK-45786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.4, 3.3.3, 3.4.1, 3.5.0, 4.0.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Blocker
>  Labels: correctness, pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1
>
>
> Decimal multiplication and division results may be inaccurate due to rounding 
> issues.
> h2. Multiplication:
> {code:scala}
> scala> sql("select  -14120025096157587712113961295153.858047 * 
> -0.4652").show(truncate=false)
> ++
>   
> |(-14120025096157587712113961295153.858047 * -0.4652)|
> ++
> |6568635674732509803675414794505.574764  |
> ++
> {code}
> The correct answer is
> {quote}6568635674732509803675414794505.574763
> {quote}
> Please note that the last digit is 3 instead of 4 as
>  
> {code:scala}
> scala> 
> java.math.BigDecimal("-14120025096157587712113961295153.858047").multiply(java.math.BigDecimal("-0.4652"))
> val res21: java.math.BigDecimal = 6568635674732509803675414794505.5747634644
> {code}
> Since the factional part .574763 is followed by 4644, it should not be 
> rounded up.
> h2. Division:
> {code:scala}
> scala> sql("select -0.172787979 / 
> 533704665545018957788294905796.5").show(truncate=false)
> +-+
> |(-0.172787979 / 533704665545018957788294905796.5)|
> +-+
> |-3.237521E-31|
> +-+
> {code}
> The correct answer is
> {quote}-3.237520E-31
> {quote}
> Please note that the last digit is 0 instead of 1 as
>  
> {code:scala}
> scala> 
> java.math.BigDecimal("-0.172787979").divide(java.math.BigDecimal("533704665545018957788294905796.5"),
>  100, java.math.RoundingMode.DOWN)
> val res22: java.math.BigDecimal = 
> -3.237520489418037889998826491401059986665344697406144511563561222578738E-31
> {code}
> Since the factional part .237520 is followed by 4894..., it should not be 
> rounded up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46915) Simplify `UnaryMinus` and align error class

2024-02-02 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-46915:


Assignee: BingKun Pan

> Simplify `UnaryMinus` and align error class
> ---
>
> Key: SPARK-46915
> URL: https://issues.apache.org/jira/browse/SPARK-46915
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46915) Simplify `UnaryMinus` and align error class

2024-02-02 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-46915.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44942
[https://github.com/apache/spark/pull/44942]

> Simplify `UnaryMinus` and align error class
> ---
>
> Key: SPARK-46915
> URL: https://issues.apache.org/jira/browse/SPARK-46915
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40549) PYSPARK: Observation computes the wrong results when using `corr` function

2024-02-02 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813780#comment-17813780
 ] 

Nicholas Chammas commented on SPARK-40549:
--

I think this is just a consequence of floating point arithmetic being imprecise.
{code:python}
>>> for i in range(10):
...     o = Observation(f"test_{i}")
...     df_o = df.observe(o, F.corr("id", "id2"))
...     df_o.count()
...     print(o.get)
... 
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0002}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0002}
{'corr(id, id2)': 0.}
{'corr(id, id2)': 1.0} {code}
Unfortunately, {{corr}} seems to convert to float internally, so even if you 
give it decimals you will get a similar result:
{code:python}
>>> from decimal import Decimal
>>> import pyspark.sql.functions as F
>>> 
>>> df = spark.createDataFrame(
...     [(Decimal(i), Decimal(i * 10)) for i in range(10)],
...     schema="id decimal, id2 decimal",
... )for i in range(10):
    o = Observation(f"test_{i}")
    df_o = df.observe(o, F.corr("id", "id2"))
    df_o.count()
    print(o.get)
>>> 
>>> for i in range(10):
...     o = Observation(f"test_{i}")
...     df_o = df.observe(o, F.corr("id", "id2"))
...     df_o.count()
...     print(o.get)
... 
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 0.}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0002}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0}
{'corr(id, id2)': 1.0} {code}

I don't think there is anything that can be done here.

> PYSPARK: Observation computes the wrong results when using `corr` function 
> ---
>
> Key: SPARK-40549
> URL: https://issues.apache.org/jira/browse/SPARK-40549
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
> Environment: {code:java}
> // lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:    Ubuntu 22.04.1 LTS
> Release:        22.04
> Codename:       jammy {code}
> {code:java}
>  // python -V
> python 3.10.4
> {code}
> {code:java}
>  // lshw -class cpu
> *-cpu                             
> description: CPU        product: AMD Ryzen 9 3900X 12-Core Processor        
> vendor: Advanced Micro Devices [AMD]        physical id: f        bus info: 
> cpu@0        version: 23.113.0        serial: Unknown        slot: AM4        
> size: 2194MHz        capacity: 4672MHz        width: 64 bits        clock: 
> 100MHz        capabilities: lm fpu fpu_exception wp vme de pse tsc msr pae 
> mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht 
> syscall nx mmxext fxsr_opt pdpe1gb rdtscp x86-64 constant_tsc rep_good nopl 
> nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma 
> cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy 
> svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit 
> wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 
> cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm 
> rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves 
> cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr 
> rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean 
> flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif 
> v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es cpufreq      
>   configuration: cores=12 enabledcores=12 microcode=141561875 threads=24
> {code}
>Reporter: Herminio Vazquez
>Priority: Major
>  Labels: correctness
>
> Minimalistic description of the odd computation results.
> When creating a new `Observation` object and computing a simple correlation 
> function between 2 columns, the results appear to be non-deterministic.
> {code:java}
> # Init
> from pyspark.sql import SparkSession, Observation
> import pyspark.sql.functions as F
> df = spark.createDataFrame([(float(i), float(i*10),) for i in range(10)], 
> schema="id double, id2 double")
> for i in range(10):
>     o = Observation(f"test_{i}")
>     df_o = df.observe(o, F.corr("id", "id2").eqNullSafe(1.0))
>     df_o.count()
> print(o.get)
> # Results
> {'(corr(id, id2) <=> 1.0)': False}
> {'(corr(id, id2) <=> 1.0)': False}
> {'(corr(id, id2) <=> 1.0)': False}
> {'(corr(id, id2) <=> 1.0)': True}
> {'(corr(id, id2) <=> 1.0)': True}
> {'(corr(id, id2) <=> 1.0)': True}
> {'(corr(id, id2) <=> 1.0)': True}
> {'(corr(id, id2) <=> 1.0)': True}
> {'(corr(id, id2) <=> 1.0)': True}
> {'(corr(id, id2) <=> 1.0)': False}{code}
>  



-

[jira] [Created] (SPARK-46960) Testing Multiple Input Streams for TransformWithState operator

2024-02-02 Thread Eric Marnadi (Jira)
Eric Marnadi created SPARK-46960:


 Summary: Testing Multiple Input Streams for TransformWithState 
operator
 Key: SPARK-46960
 URL: https://issues.apache.org/jira/browse/SPARK-46960
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Eric Marnadi


Adding unit tests to ensure multiple input streams are supported for the 
TransformWithState operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46961) Adding processorHandle as a Context Variable

2024-02-02 Thread Eric Marnadi (Jira)
Eric Marnadi created SPARK-46961:


 Summary: Adding processorHandle as a Context Variable
 Key: SPARK-46961
 URL: https://issues.apache.org/jira/browse/SPARK-46961
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Eric Marnadi


Adding unit tests to ensure multiple input streams are supported for the 
TransformWithState operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46866) Streaming python data source API

2024-02-02 Thread Chaoqin Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chaoqin Li updated SPARK-46866:
---
Issue Type: Epic  (was: Improvement)

> Streaming python data source API
> 
>
> Key: SPARK-46866
> URL: https://issues.apache.org/jira/browse/SPARK-46866
> Project: Spark
>  Issue Type: Epic
>  Components: SS
>Affects Versions: 3.5.0
>Reporter: Chaoqin Li
>Priority: Major
>
> This is a follow up of https://issues.apache.org/jira/browse/SPARK-44076. The 
> idea is to enable Python developers to develop streaming data sources in 
> python. The goal is to make a Python-based API that is simple and easy to 
> use, thus making Spark more accessible to the wider Python developer 
> community.
>  
> Design doc: 
> https://docs.google.com/document/d/1cJ-w1hGPOBFp-5DLmf68sTLsAOwb55oW6SAuuAUFEM4/edit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46962) Implement python worker to run python streaming data source

2024-02-02 Thread Chaoqin Li (Jira)
Chaoqin Li created SPARK-46962:
--

 Summary: Implement python worker to run python streaming data 
source
 Key: SPARK-46962
 URL: https://issues.apache.org/jira/browse/SPARK-46962
 Project: Spark
  Issue Type: Improvement
  Components: SS
Affects Versions: 4.0.0
Reporter: Chaoqin Li


Implement python worker to run python streaming data source and communicate 
with JVM through socket. Create a PythonMicrobatchStream to invoke RPC function 
call



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46963) Verify AQE is not enabled for Structured Streaming

2024-02-02 Thread Bo Gao (Jira)
Bo Gao created SPARK-46963:
--

 Summary: Verify AQE is not enabled for Structured Streaming
 Key: SPARK-46963
 URL: https://issues.apache.org/jira/browse/SPARK-46963
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Bo Gao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46963) Verify AQE is not enabled for Structured Streaming

2024-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46963:
---
Labels: pull-request-available  (was: )

> Verify AQE is not enabled for Structured Streaming
> --
>
> Key: SPARK-46963
> URL: https://issues.apache.org/jira/browse/SPARK-46963
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Bo Gao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46964) Change the signature of the hllInvalidLgK query execution error to take an integer as 4th argument

2024-02-02 Thread Menelaos Karavelas (Jira)
Menelaos Karavelas created SPARK-46964:
--

 Summary: Change the signature of the hllInvalidLgK query execution 
error to take an integer as 4th argument
 Key: SPARK-46964
 URL: https://issues.apache.org/jira/browse/SPARK-46964
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Menelaos Karavelas


The current signature of the {{hllInvalidLgK}} query execution error takes four 
arguments:
 # The SQL function (a string).
 # The minimum possible {{lgk}} value (an integer).
 # The maximum possible {{lgk}} value (an integer).
 # The actual invalid {{lgk}} value (a string).

There is no meaningful reason for the 4th argument to be a string. This issue 
is about changing the 4th argument to an integer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46964) Change the signature of the hllInvalidLgK query execution error to take an integer as 4th argument

2024-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46964:
---
Labels: pull-request-available  (was: )

> Change the signature of the hllInvalidLgK query execution error to take an 
> integer as 4th argument
> --
>
> Key: SPARK-46964
> URL: https://issues.apache.org/jira/browse/SPARK-46964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Menelaos Karavelas
>Priority: Trivial
>  Labels: pull-request-available
>
> The current signature of the {{hllInvalidLgK}} query execution error takes 
> four arguments:
>  # The SQL function (a string).
>  # The minimum possible {{lgk}} value (an integer).
>  # The maximum possible {{lgk}} value (an integer).
>  # The actual invalid {{lgk}} value (a string).
> There is no meaningful reason for the 4th argument to be a string. This issue 
> is about changing the 4th argument to an integer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46964) Change the signature of the hllInvalidLgK query execution error to take an integer as 4th argument

2024-02-02 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-46964.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44995
[https://github.com/apache/spark/pull/44995]

> Change the signature of the hllInvalidLgK query execution error to take an 
> integer as 4th argument
> --
>
> Key: SPARK-46964
> URL: https://issues.apache.org/jira/browse/SPARK-46964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Menelaos Karavelas
>Assignee: Menelaos Karavelas
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> The current signature of the {{hllInvalidLgK}} query execution error takes 
> four arguments:
>  # The SQL function (a string).
>  # The minimum possible {{lgk}} value (an integer).
>  # The maximum possible {{lgk}} value (an integer).
>  # The actual invalid {{lgk}} value (a string).
> There is no meaningful reason for the 4th argument to be a string. This issue 
> is about changing the 4th argument to an integer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46964) Change the signature of the hllInvalidLgK query execution error to take an integer as 4th argument

2024-02-02 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-46964:
--

Assignee: Menelaos Karavelas

> Change the signature of the hllInvalidLgK query execution error to take an 
> integer as 4th argument
> --
>
> Key: SPARK-46964
> URL: https://issues.apache.org/jira/browse/SPARK-46964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Menelaos Karavelas
>Assignee: Menelaos Karavelas
>Priority: Trivial
>  Labels: pull-request-available
>
> The current signature of the {{hllInvalidLgK}} query execution error takes 
> four arguments:
>  # The SQL function (a string).
>  # The minimum possible {{lgk}} value (an integer).
>  # The maximum possible {{lgk}} value (an integer).
>  # The actual invalid {{lgk}} value (a string).
> There is no meaningful reason for the 4th argument to be a string. This issue 
> is about changing the 4th argument to an integer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46965) Check logType in Utils.getLog

2024-02-02 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-46965:
-

 Summary: Check logType in Utils.getLog
 Key: SPARK-46965
 URL: https://issues.apache.org/jira/browse/SPARK-46965
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46965) Check logType in Utils.getLog

2024-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46965:
---
Labels: pull-request-available  (was: )

> Check logType in Utils.getLog
> -
>
> Key: SPARK-46965
> URL: https://issues.apache.org/jira/browse/SPARK-46965
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46965) Check logType in Utils.getLog

2024-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46965:
-

Assignee: Dongjoon Hyun

> Check logType in Utils.getLog
> -
>
> Key: SPARK-46965
> URL: https://issues.apache.org/jira/browse/SPARK-46965
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46638) Create API to acquire execution memory for 'eval' and 'terminate' methods

2024-02-02 Thread Daniel (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel resolved SPARK-46638.

Resolution: Won't Fix

> Create API to acquire execution memory for 'eval' and 'terminate' methods
> -
>
> Key: SPARK-46638
> URL: https://issues.apache.org/jira/browse/SPARK-46638
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Daniel
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46966) Create API for 'analyze' method to indicate subset of input table columns to select

2024-02-02 Thread Daniel (Jira)
Daniel created SPARK-46966:
--

 Summary: Create API for 'analyze' method to indicate subset of 
input table columns to select
 Key: SPARK-46966
 URL: https://issues.apache.org/jira/browse/SPARK-46966
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Daniel






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46966) Create API for 'analyze' method to indicate subset of input table columns to select

2024-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46966:
---
Labels: pull-request-available  (was: )

> Create API for 'analyze' method to indicate subset of input table columns to 
> select
> ---
>
> Key: SPARK-46966
> URL: https://issues.apache.org/jira/browse/SPARK-46966
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Daniel
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46965) Check logType in Utils.getLog

2024-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46965.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45006
[https://github.com/apache/spark/pull/45006]

> Check logType in Utils.getLog
> -
>
> Key: SPARK-46965
> URL: https://issues.apache.org/jira/browse/SPARK-46965
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46890) CSV fails on a column with default and without enforcing schema

2024-02-02 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-46890:


Assignee: Daniel

> CSV fails on a column with default and without enforcing schema
> ---
>
> Key: SPARK-46890
> URL: https://issues.apache.org/jira/browse/SPARK-46890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Daniel
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-01-29-13-22-05-326.png
>
>
> When we create a table using CSV on an existing file with a header and:
>  - a column has an default +
>  - enforceSchema is false - taking into account CSV header
> then query a column with a default.
> The example below shows the issue:
> {code:sql}
> CREATE TABLE IF NOT EXISTS products (
>   product_id INT,
>   name STRING,
>   price FLOAT default 0.0,
>   quantity INT default 0
> )
> USING CSV
> OPTIONS (
>   header 'true',
>   inferSchema 'false',
>   enforceSchema 'false',
>   path '/Users/maximgekk/tmp/products.csv'
> );
> {code}
> The CSV file products.csv:
> {code:java}
> product_id,name,price,quantity
> 1,Apple,0.50,100
> 2,Banana,0.25,200
> 3,Orange,0.75,50
> {code}
> The query fails:
> {code:sql}
> spark-sql (default)> SELECT price FROM products;
> 24/01/28 11:43:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 6)
> java.lang.IllegalArgumentException: Number of column in CSV header is not 
> equal to number of fields in the schema:
>  Header length: 4, schema size: 1
> CSV file: file:///Users/maximgekk/tmp/products.csv
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46890) CSV fails on a column with default and without enforcing schema

2024-02-02 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-46890.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44939
[https://github.com/apache/spark/pull/44939]

> CSV fails on a column with default and without enforcing schema
> ---
>
> Key: SPARK-46890
> URL: https://issues.apache.org/jira/browse/SPARK-46890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Daniel
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: image-2024-01-29-13-22-05-326.png
>
>
> When we create a table using CSV on an existing file with a header and:
>  - a column has an default +
>  - enforceSchema is false - taking into account CSV header
> then query a column with a default.
> The example below shows the issue:
> {code:sql}
> CREATE TABLE IF NOT EXISTS products (
>   product_id INT,
>   name STRING,
>   price FLOAT default 0.0,
>   quantity INT default 0
> )
> USING CSV
> OPTIONS (
>   header 'true',
>   inferSchema 'false',
>   enforceSchema 'false',
>   path '/Users/maximgekk/tmp/products.csv'
> );
> {code}
> The CSV file products.csv:
> {code:java}
> product_id,name,price,quantity
> 1,Apple,0.50,100
> 2,Banana,0.25,200
> 3,Orange,0.75,50
> {code}
> The query fails:
> {code:sql}
> spark-sql (default)> SELECT price FROM products;
> 24/01/28 11:43:09 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 6)
> java.lang.IllegalArgumentException: Number of column in CSV header is not 
> equal to number of fields in the schema:
>  Header length: 4, schema size: 1
> CSV file: file:///Users/maximgekk/tmp/products.csv
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44111) Prepare Apache Spark 4.0.0

2024-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44111:
---
Labels: pull-request-available  (was: )

> Prepare Apache Spark 4.0.0
> --
>
> Key: SPARK-44111
> URL: https://issues.apache.org/jira/browse/SPARK-44111
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>  Labels: pull-request-available
>
> For now, this issue aims to collect ideas for planning Apache Spark 4.0.0.
> We will add more items which will be excluded from Apache Spark 3.5.0 
> (Feature Freeze: July 16th, 2023).
> {code}
> Spark 1: 2014.05 (1.0.0) ~ 2016.11 (1.6.3)
> Spark 2: 2016.07 (2.0.0) ~ 2021.05 (2.4.8)
> Spark 3: 2020.06 (3.0.0) ~ 2026.xx (3.5.x)
> Spark 4: 2024.06 (4.0.0, NEW)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46950) Align `not available codec` error-class

2024-02-02 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-46950:


Assignee: BingKun Pan

> Align `not available codec` error-class
> ---
>
> Key: SPARK-46950
> URL: https://issues.apache.org/jira/browse/SPARK-46950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46950) Align `not available codec` error-class

2024-02-02 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-46950.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44992
[https://github.com/apache/spark/pull/44992]

> Align `not available codec` error-class
> ---
>
> Key: SPARK-46950
> URL: https://issues.apache.org/jira/browse/SPARK-46950
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46967) Hide `Thread Dump` and `Heap Histogram` of `Dead` executors in `Executors` UI

2024-02-02 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46967:
--
Component/s: Web UI

> Hide `Thread Dump` and `Heap Histogram` of `Dead` executors in `Executors` UI
> -
>
> Key: SPARK-46967
> URL: https://issues.apache.org/jira/browse/SPARK-46967
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Web UI
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46967) Hide `Thread Dump` and `Heap Histogram` of `Dead` executors in `Executors` UI

2024-02-02 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-46967:
-

 Summary: Hide `Thread Dump` and `Heap Histogram` of `Dead` 
executors in `Executors` UI
 Key: SPARK-46967
 URL: https://issues.apache.org/jira/browse/SPARK-46967
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46968) Replace UnsupportedOperationException by SparkUnsupportedOperationException in sql

2024-02-02 Thread Max Gekk (Jira)
Max Gekk created SPARK-46968:


 Summary: Replace UnsupportedOperationException by 
SparkUnsupportedOperationException in sql
 Key: SPARK-46968
 URL: https://issues.apache.org/jira/browse/SPARK-46968
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Max Gekk
Assignee: Max Gekk
 Fix For: 4.0.0


Replace all UnsupportedOperationException by SparkUnsupportedOperationException 
in sql/core code base, and introduce new legacy error classes with the 
_LEGACY_ERROR_TEMP_ prefix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46967) Hide `Thread Dump` and `Heap Histogram` of `Dead` executors in `Executors` UI

2024-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46967:
---
Labels: pull-request-available  (was: )

> Hide `Thread Dump` and `Heap Histogram` of `Dead` executors in `Executors` UI
> -
>
> Key: SPARK-46967
> URL: https://issues.apache.org/jira/browse/SPARK-46967
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Web UI
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46968) Replace UnsupportedOperationException by SparkUnsupportedOperationException in sql

2024-02-02 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-46968:
-
Description: Replace all UnsupportedOperationException by 
SparkUnsupportedOperationException in the *sql* code base, and introduce new 
legacy error classes with the _LEGACY_ERROR_TEMP_ prefix.  (was: Replace all 
UnsupportedOperationException by SparkUnsupportedOperationException in sql/core 
code base, and introduce new legacy error classes with the _LEGACY_ERROR_TEMP_ 
prefix.)

> Replace UnsupportedOperationException by SparkUnsupportedOperationException 
> in sql
> --
>
> Key: SPARK-46968
> URL: https://issues.apache.org/jira/browse/SPARK-46968
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Replace all UnsupportedOperationException by 
> SparkUnsupportedOperationException in the *sql* code base, and introduce new 
> legacy error classes with the _LEGACY_ERROR_TEMP_ prefix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org