[jira] [Updated] (SPARK-40298) shuffle data recovery on the reused PVCs no effect

2022-09-01 Thread todd (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

todd updated SPARK-40298:
-
Priority: Blocker  (was: Major)

> shuffle data recovery on the reused PVCs  no effect
> ---
>
> Key: SPARK-40298
> URL: https://issues.apache.org/jira/browse/SPARK-40298
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.2
>Reporter: todd
>Priority: Blocker
> Attachments: 1662002808396.jpg, 1662002822097.jpg
>
>
> I use spark3.2.2 to test the [ Support shuffle data recovery on the reused 
> PVCs (SPARK-35593) ] feature.I found that when shuffle read fails, data is 
> still read from source.
> It can be confirmed that the pvc has been multiplexed by other pods, and the 
> Index and data data information has been sent
> *This is my spark configuration information:*
> --conf spark.driver.memory=5G 
> --conf spark.executor.memory=15G 
> --conf spark.executor.cores=1
> --conf spark.executor.instances=50
> --conf spark.sql.shuffle.partitions=50
> --conf spark.dynamicAllocation.enabled=false
> --conf spark.kubernetes.driver.reusePersistentVolumeClaim=true
> --conf spark.kubernetes.driver.ownPersistentVolumeClaim=true
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=gp2
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=100Gi
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/tmp/data
> --conf 
> spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false
> --conf spark.executorEnv.SPARK_EXECUTOR_DIRS=/tmp/data
> --conf 
> spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO
> --conf spark.kubernetes.executor.missingPodDetectDelta=10s
> --conf spark.kubernetes.executor.apiPollingInterval=10s
> --conf spark.shuffle.io.retryWait=60s
> --conf spark.shuffle.io.maxRetries=5
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40301) Add parameter validation in pyspark.rdd

2022-09-01 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-40301:
-

 Summary: Add parameter validation in pyspark.rdd
 Key: SPARK-40301
 URL: https://issues.apache.org/jira/browse/SPARK-40301
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40301) Add parameter validation in pyspark.rdd

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598784#comment-17598784
 ] 

Apache Spark commented on SPARK-40301:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37752

> Add parameter validation in pyspark.rdd
> ---
>
> Key: SPARK-40301
> URL: https://issues.apache.org/jira/browse/SPARK-40301
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40301) Add parameter validation in pyspark.rdd

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40301:


Assignee: (was: Apache Spark)

> Add parameter validation in pyspark.rdd
> ---
>
> Key: SPARK-40301
> URL: https://issues.apache.org/jira/browse/SPARK-40301
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40301) Add parameter validation in pyspark.rdd

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40301:


Assignee: Apache Spark

> Add parameter validation in pyspark.rdd
> ---
>
> Key: SPARK-40301
> URL: https://issues.apache.org/jira/browse/SPARK-40301
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39664) RowMatrix(...).computeCovariance() VS Correlation.corr(..., ...)

2022-09-01 Thread Ruifeng Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598813#comment-17598813
 ] 

Ruifeng Zheng commented on SPARK-39664:
---

[~igaloly] _RowMatrix_ is in .mllib, while _Correlation_ is in .ml.  The .mllib 
is implemented on top of RDD, while .ml on DataFrame.

.mllib is in maintainance mode,  and we update it only if there is a 
correctness issue.

> RowMatrix(...).computeCovariance() VS Correlation.corr(..., ...)
> 
>
> Key: SPARK-39664
> URL: https://issues.apache.org/jira/browse/SPARK-39664
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.2.1
>Reporter: igal l
>Priority: Major
>
> I have a Pyspark DF with one column. This column type is Vector and the 
> values are DenseVectors of size 768.  The DF has 1 million rows.
> I want to calculate the Covariance matrix of this set of vectors.
> When I try to calculate it with 
> `RowMatrix(df.rdd.map(list)).computeCovariance()`, it takes 1.57 minuts.
> When I try to calculate the Correlation matrix with `Correlation.corr(df, 
> '_1')`, it takes 33 seconds.
> Covariance and Correlation's formula are pretty much the same, therefore, I 
> don't understand the gap between them



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39664) RowMatrix(...).computeCovariance() VS Correlation.corr(..., ...)

2022-09-01 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-39664.
---
Resolution: Not A Problem

> RowMatrix(...).computeCovariance() VS Correlation.corr(..., ...)
> 
>
> Key: SPARK-39664
> URL: https://issues.apache.org/jira/browse/SPARK-39664
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.2.1
>Reporter: igal l
>Priority: Major
>
> I have a Pyspark DF with one column. This column type is Vector and the 
> values are DenseVectors of size 768.  The DF has 1 million rows.
> I want to calculate the Covariance matrix of this set of vectors.
> When I try to calculate it with 
> `RowMatrix(df.rdd.map(list)).computeCovariance()`, it takes 1.57 minuts.
> When I try to calculate the Correlation matrix with `Correlation.corr(df, 
> '_1')`, it takes 33 seconds.
> Covariance and Correlation's formula are pretty much the same, therefore, I 
> don't understand the gap between them



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40302) Add YuniKornSuite

2022-09-01 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-40302:
-

 Summary: Add YuniKornSuite
 Key: SPARK-40302
 URL: https://issues.apache.org/jira/browse/SPARK-40302
 Project: Spark
  Issue Type: Test
  Components: Kubernetes, Tests
Affects Versions: 3.4.0, 3.3.1
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40302) Add YuniKornSuite

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598826#comment-17598826
 ] 

Apache Spark commented on SPARK-40302:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37753

> Add YuniKornSuite
> -
>
> Key: SPARK-40302
> URL: https://issues.apache.org/jira/browse/SPARK-40302
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40302) Add YuniKornSuite

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40302:


Assignee: Apache Spark

> Add YuniKornSuite
> -
>
> Key: SPARK-40302
> URL: https://issues.apache.org/jira/browse/SPARK-40302
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40302) Add YuniKornSuite

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40302:


Assignee: (was: Apache Spark)

> Add YuniKornSuite
> -
>
> Key: SPARK-40302
> URL: https://issues.apache.org/jira/browse/SPARK-40302
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40303) The performance will be worse after codegen

2022-09-01 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-40303:
---

 Summary: The performance will be worse after codegen
 Key: SPARK-40303
 URL: https://issues.apache.org/jira/browse/SPARK-40303
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yuming Wang



{code:scala}
import org.apache.spark.benchmark.Benchmark

val dir = "/tmp/spark/benchmark"
val N = 200
val columns = Range(0, 100).map(i => s"id % $i AS id$i")

spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir)

// Seq(1, 2, 5, 10, 15, 25, 40, 60, 100)
Seq(60).foreach{ cnt =>
  val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => 
s"count(distinct $c)")

  val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = 1)
  benchmark.addCase(s"$cnt count distinct with codegen") { _ =>
withSQLConf(
  "spark.sql.codegen.wholeStage" -> "true",
  "spark.sql.codegen.factoryMode" -> "FALLBACK") {
  spark.read.parquet(dir).selectExpr(selectExps: 
_*).write.format("noop").mode("Overwrite").save()
}
  }

  benchmark.addCase(s"$cnt count distinct without codegen") { _ =>
withSQLConf(
  "spark.sql.codegen.wholeStage" -> "false",
  "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") {
  spark.read.parquet(dir).selectExpr(selectExps: 
_*).write.format("noop").mode("Overwrite").save()
}
  }
  benchmark.run()
}
{code}


{noformat}
Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Benchmark count distinct: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

60 count distinct with codegen   628146 628146  
 0  0.0  314072.8   1.0X
60 count distinct without codegen147635 147635  
 0  0.0   73817.5   4.3X
{noformat}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40303) The performance will be worse after codegen

2022-09-01 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598831#comment-17598831
 ] 

Yuming Wang commented on SPARK-40303:
-

cc [~cloud_fan] [~joshrosen] [~rednaxelafx]

> The performance will be worse after codegen
> ---
>
> Key: SPARK-40303
> URL: https://issues.apache.org/jira/browse/SPARK-40303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> import org.apache.spark.benchmark.Benchmark
> val dir = "/tmp/spark/benchmark"
> val N = 200
> val columns = Range(0, 100).map(i => s"id % $i AS id$i")
> spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir)
> // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100)
> Seq(60).foreach{ cnt =>
>   val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => 
> s"count(distinct $c)")
>   val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = 
> 1)
>   benchmark.addCase(s"$cnt count distinct with codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "true",
>   "spark.sql.codegen.factoryMode" -> "FALLBACK") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.addCase(s"$cnt count distinct without codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "false",
>   "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.run()
> }
> {code}
> {noformat}
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7
> Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> Benchmark count distinct: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> 60 count distinct with codegen   628146 628146
>0  0.0  314072.8   1.0X
> 60 count distinct without codegen147635 147635
>0  0.0   73817.5   4.3X
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40287) Load Data using Spark by a single partition moves entire dataset under same location in S3

2022-09-01 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598835#comment-17598835
 ] 

Steve Loughran commented on SPARK-40287:


does this happen when
# you switch to an ASF spark build with the s3a connector
# and use an s3a committer safe to use with spark

this is clearly EMR (s3:// URLs), so they have to be the people to talk to if 
you can't replicate it in the apache code

> Load Data using Spark by a single partition moves entire dataset under same 
> location in S3
> --
>
> Key: SPARK-40287
> URL: https://issues.apache.org/jira/browse/SPARK-40287
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Drew
>Priority: Major
>
> Hello,
> I'm experiencing an issue in PySpark when creating a hive table and loading 
> in the data to the table. So I'm using an Amazon s3 bucket as a data location 
> and I'm creating a table as parquet and trying to load data into that table 
> by a single partition, and I'm seeing some weird behavior. When selecting the 
> data location in s3 of a parquet file to load into my table. All of the data 
> is moved into the specified location in my create table command including the 
> partitions I didn't specify in the load data command. For example:
> {code:java}
> # create a data frame in pyspark with partitions
> df = spark.createDataFrame([("a", 1, "x"), ("b", 2, "y"), ("c", 3, "y")], 
> ["c1", "c2", "p"])
> # save it to S3
> df.write.format("parquet").mode("overwrite").partitionBy("p").save("s3://bucket/data/")
> {code}
> In the current state S3 should have a new folder `data` with two folders 
> which contain a parquet file in each partition. 
>   
>  - s3://bucket/data/p=x/
>     - part-1.snappy.parquet
>  - s3://bucket/data/p=y/
>     - part-2.snappy.parquet
>     - part-3.snappy.parquet
>  
> {code:java}
> # create new table
> spark.sql("create table src (c1 string,c2 int) PARTITIONED BY (p string) 
> STORED AS parquet LOCATION 's3://bucket/new/'")
> # load the saved table data from s3 specifying single partition value x
> spark.sql("LOAD DATA INPATH 's3://bucket/data/'INTO TABLE src PARTITION 
> (p='x')")
> spark.sql("select * from src").show()
> # output: 
> # +---+---+---+
> # | c1| c2|  p|
> # +---+---+---+
> # +---+---+---+
> {code}
> After running the `load data` command, and looking at the table I'm left with 
> no data loaded in. When checking S3 the data source we saved earlier is moved 
> under `s3://bucket/new/` oddly enough it also brought over the other 
> partitions along with it directory structure listed below. 
> - s3://bucket/new/
>     - p=x/
>         - p=x/
>             - part-1.snappy.parquet
>         - p=y/
>             - part-2.snappy.parquet
>             - part-3.snappy.parquet
> Is this the intended behavior of loading the data in from a partitioned 
> parquet file? Is the previous file supposed to be moved/deleted from source 
> directory? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40286) Load Data from S3 deletes data source file

2022-09-01 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598837#comment-17598837
 ] 

Steve Loughran commented on SPARK-40286:


this is EMR. can you repliacate in an ASF spark release through the s3a 
connector and committers?

if you can replicate, especiallly  in spark standadone, turn spark and 
org.apache.hadoop.fs.s3a logging on to debug and see what it says

> Load Data from S3 deletes data source file
> --
>
> Key: SPARK-40286
> URL: https://issues.apache.org/jira/browse/SPARK-40286
> Project: Spark
>  Issue Type: Question
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Drew
>Priority: Major
>
> Hello, 
> I'm using spark to [load 
> data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into 
> a hive table through Pyspark, and when I load data from a path in Amazon S3, 
> the original file is getting wiped from the Directory. The file is found, and 
> is populating the table with data. I also tried to add the `Local` clause but 
> that throws an error when looking for the file. When looking through the 
> documentation it doesn't explicitly state that this is the intended behavior.
> Thanks in advance!
> {code:java}
> spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile")
> spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE 
> src"){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39906) Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead'

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598845#comment-17598845
 ] 

Apache Spark commented on SPARK-39906:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37754

> Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash 
> syntax instead'
> --
>
> Key: SPARK-39906
> URL: https://issues.apache.org/jira/browse/SPARK-39906
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> 1.   
> {code:java}
> ./2_Run  Build modules catalyst, 
> hive-thriftserver.txt:2022-07-27T01:23:12.1294533Z [warn] sbt 0.13 shell 
> syntax is deprecated; use slash syntax instead: examples / Test / package, 
> avro / Test / package, network-shuffle / Test / package, sketch / Test / 
> package, unsafe / Test / package, launcher / Test / package, network-yarn / 
> Test / package, streaming / Test / package, catalyst / Test / package, 
> hive-thriftserver / Test / package, kvstore / Test / package, core / Test / 
> package, ganglia-lgpl / Test / package, hadoop-cloud / Test / package, 
> streaming-kinesis-asl-assembly / Test / package, assembly / Test / package, 
> network-common / Test / package, sql / Test / package, 
> streaming-kafka-0-10-assembly / Test / package, mllib / Test / package, 
> streaming-kinesis-asl / Test / package, docker-integration-tests / Test / 
> package, kubernetes / Test / package, yarn / Test / package, tags / Test / 
> package, graphx / Test / package, token-provider-kafka-0-10 / Test / package, 
> mesos / Test / package, streaming-kafka-0-10 / Test / package, hive / Test / 
> package, tools / Test / package, mllib-local / Test / package, repl / Test / 
> package, sql-kafka-0-10 / Test / package, Test / package
> ./2_Run  Build modules catalyst, 
> hive-thriftserver.txt:2022-07-27T01:23:12.1294533Z [warn] sbt 0.13 shell 
> syntax is deprecated; use slash syntax instead: examples / Test / package, 
> avro / Test / package, network-shuffle / Test / package, sketch / Test / 
> package, unsafe / Test / package, launcher / Test / package, network-yarn / 
> Test / package, streaming / Test / package, catalyst / Test / package, 
> hive-thriftserver / Test / package, kvstore / Test / package, core / Test / 
> package, ganglia-lgpl / Test / package, hadoop-cloud / Test / package, 
> streaming-kinesis-asl-assembly / Test / package, assembly / Test / package, 
> network-common / Test / package, sql / Test / package, 
> streaming-kafka-0-10-assembly / Test / package, mllib / Test / package, 
> streaming-kinesis-asl / Test / package, docker-integration-tests / Test / 
> package, kubernetes / Test / package, yarn / Test / package, tags / Test / 
> package, graphx / Test / package, token-provider-kafka-0-10 / Test / package, 
> mesos / Test / package, streaming-kafka-0-10 / Test / package, hive / Test / 
> package, tools / Test / package, mllib-local / Test / package, repl / Test / 
> package, sql-kafka-0-10 / Test / package, Test / package
> ./Run  Build modules pyspark-core, pyspark-streaming, pyspark-ml/11_Run 
> tests.txt:2022-07-27T01:25:30.0840251Z [warn] sbt 0.13 shell syntax is 
> deprecated; use slash syntax instead: network-yarn / Test / package, 
> network-shuffle / Test / package, sketch / Test / package, yarn / Test / 
> package, sql / Test / package, core / Test / package, hive-thriftserver / 
> Test / package, ganglia-lgpl / Test / package, streaming / Test / package, 
> streaming-kinesis-asl / Test / package, docker-integration-tests / Test / 
> package, kubernetes / Test / package, launcher / Test / package, 
> streaming-kinesis-asl-assembly / Test / package, tags / Test / package, 
> assembly / Test / package, mllib-local / Test / package, 
> token-provider-kafka-0-10 / Test / package, repl / Test / package, graphx / 
> Test / package, sql-kafka-0-10 / Test / package, mesos / Test / package, 
> streaming-kafka-0-10 / Test / package, streaming-kafka-0-10-assembly / Test / 
> package, examples / Test / package, tools / Test / package, avro / Test / 
> package, hadoop-cloud / Test / package, mllib / Test / package, kvstore / 
> Test / package, hive / Test / package, catalyst / Test / package, 
> network-common / Test / package, unsafe / Test / package, Test / package
> ./Run  Build modules pyspark-core, pyspark-streaming, pyspark-ml/11_Run 
> tests.txt:2022-07-27T01:25:30.0840251Z [warn] sbt 0.13 shell syntax is 
> deprecated; use slash syntax instead: network-yarn / Test / package, 
> network-shuffle / Test / package, sketch / Test / package, yarn / Test / 
> package, sql / Test 

[jira] [Commented] (SPARK-39906) Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead'

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598847#comment-17598847
 ] 

Apache Spark commented on SPARK-39906:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37754

> Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash 
> syntax instead'
> --
>
> Key: SPARK-39906
> URL: https://issues.apache.org/jira/browse/SPARK-39906
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.4.0
>
>
> 1.   
> {code:java}
> ./2_Run  Build modules catalyst, 
> hive-thriftserver.txt:2022-07-27T01:23:12.1294533Z [warn] sbt 0.13 shell 
> syntax is deprecated; use slash syntax instead: examples / Test / package, 
> avro / Test / package, network-shuffle / Test / package, sketch / Test / 
> package, unsafe / Test / package, launcher / Test / package, network-yarn / 
> Test / package, streaming / Test / package, catalyst / Test / package, 
> hive-thriftserver / Test / package, kvstore / Test / package, core / Test / 
> package, ganglia-lgpl / Test / package, hadoop-cloud / Test / package, 
> streaming-kinesis-asl-assembly / Test / package, assembly / Test / package, 
> network-common / Test / package, sql / Test / package, 
> streaming-kafka-0-10-assembly / Test / package, mllib / Test / package, 
> streaming-kinesis-asl / Test / package, docker-integration-tests / Test / 
> package, kubernetes / Test / package, yarn / Test / package, tags / Test / 
> package, graphx / Test / package, token-provider-kafka-0-10 / Test / package, 
> mesos / Test / package, streaming-kafka-0-10 / Test / package, hive / Test / 
> package, tools / Test / package, mllib-local / Test / package, repl / Test / 
> package, sql-kafka-0-10 / Test / package, Test / package
> ./2_Run  Build modules catalyst, 
> hive-thriftserver.txt:2022-07-27T01:23:12.1294533Z [warn] sbt 0.13 shell 
> syntax is deprecated; use slash syntax instead: examples / Test / package, 
> avro / Test / package, network-shuffle / Test / package, sketch / Test / 
> package, unsafe / Test / package, launcher / Test / package, network-yarn / 
> Test / package, streaming / Test / package, catalyst / Test / package, 
> hive-thriftserver / Test / package, kvstore / Test / package, core / Test / 
> package, ganglia-lgpl / Test / package, hadoop-cloud / Test / package, 
> streaming-kinesis-asl-assembly / Test / package, assembly / Test / package, 
> network-common / Test / package, sql / Test / package, 
> streaming-kafka-0-10-assembly / Test / package, mllib / Test / package, 
> streaming-kinesis-asl / Test / package, docker-integration-tests / Test / 
> package, kubernetes / Test / package, yarn / Test / package, tags / Test / 
> package, graphx / Test / package, token-provider-kafka-0-10 / Test / package, 
> mesos / Test / package, streaming-kafka-0-10 / Test / package, hive / Test / 
> package, tools / Test / package, mllib-local / Test / package, repl / Test / 
> package, sql-kafka-0-10 / Test / package, Test / package
> ./Run  Build modules pyspark-core, pyspark-streaming, pyspark-ml/11_Run 
> tests.txt:2022-07-27T01:25:30.0840251Z [warn] sbt 0.13 shell syntax is 
> deprecated; use slash syntax instead: network-yarn / Test / package, 
> network-shuffle / Test / package, sketch / Test / package, yarn / Test / 
> package, sql / Test / package, core / Test / package, hive-thriftserver / 
> Test / package, ganglia-lgpl / Test / package, streaming / Test / package, 
> streaming-kinesis-asl / Test / package, docker-integration-tests / Test / 
> package, kubernetes / Test / package, launcher / Test / package, 
> streaming-kinesis-asl-assembly / Test / package, tags / Test / package, 
> assembly / Test / package, mllib-local / Test / package, 
> token-provider-kafka-0-10 / Test / package, repl / Test / package, graphx / 
> Test / package, sql-kafka-0-10 / Test / package, mesos / Test / package, 
> streaming-kafka-0-10 / Test / package, streaming-kafka-0-10-assembly / Test / 
> package, examples / Test / package, tools / Test / package, avro / Test / 
> package, hadoop-cloud / Test / package, mllib / Test / package, kvstore / 
> Test / package, hive / Test / package, catalyst / Test / package, 
> network-common / Test / package, unsafe / Test / package, Test / package
> ./Run  Build modules pyspark-core, pyspark-streaming, pyspark-ml/11_Run 
> tests.txt:2022-07-27T01:25:30.0840251Z [warn] sbt 0.13 shell syntax is 
> deprecated; use slash syntax instead: network-yarn / Test / package, 
> network-shuffle / Test / package, sketch / Test / package, yarn / Test / 
> package, sql / Test 

[jira] [Updated] (SPARK-40304) Add decomTestTag to K8s Integration Test

2022-09-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40304:
--
Priority: Minor  (was: Major)

> Add decomTestTag to K8s Integration Test
> 
>
> Key: SPARK-40304
> URL: https://issues.apache.org/jira/browse/SPARK-40304
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40304) Add decomTestTag to K8s Integration Test

2022-09-01 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-40304:
-

 Summary: Add decomTestTag to K8s Integration Test
 Key: SPARK-40304
 URL: https://issues.apache.org/jira/browse/SPARK-40304
 Project: Spark
  Issue Type: Test
  Components: Kubernetes, Tests
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40304) Add decomTestTag to K8s Integration Test

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598866#comment-17598866
 ] 

Apache Spark commented on SPARK-40304:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37755

> Add decomTestTag to K8s Integration Test
> 
>
> Key: SPARK-40304
> URL: https://issues.apache.org/jira/browse/SPARK-40304
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40304) Add decomTestTag to K8s Integration Test

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40304:


Assignee: (was: Apache Spark)

> Add decomTestTag to K8s Integration Test
> 
>
> Key: SPARK-40304
> URL: https://issues.apache.org/jira/browse/SPARK-40304
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40304) Add decomTestTag to K8s Integration Test

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40304:


Assignee: Apache Spark

> Add decomTestTag to K8s Integration Test
> 
>
> Key: SPARK-40304
> URL: https://issues.apache.org/jira/browse/SPARK-40304
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40304) Add decomTestTag to K8s Integration Test

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598867#comment-17598867
 ] 

Apache Spark commented on SPARK-40304:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37755

> Add decomTestTag to K8s Integration Test
> 
>
> Key: SPARK-40304
> URL: https://issues.apache.org/jira/browse/SPARK-40304
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40305) Implement Groupby.sem

2022-09-01 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-40305:
-

 Summary: Implement Groupby.sem
 Key: SPARK-40305
 URL: https://issues.apache.org/jira/browse/SPARK-40305
 Project: Spark
  Issue Type: Improvement
  Components: ps
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40305) Implement Groupby.sem

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598905#comment-17598905
 ] 

Apache Spark commented on SPARK-40305:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37756

> Implement Groupby.sem
> -
>
> Key: SPARK-40305
> URL: https://issues.apache.org/jira/browse/SPARK-40305
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40305) Implement Groupby.sem

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40305:


Assignee: (was: Apache Spark)

> Implement Groupby.sem
> -
>
> Key: SPARK-40305
> URL: https://issues.apache.org/jira/browse/SPARK-40305
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40305) Implement Groupby.sem

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598906#comment-17598906
 ] 

Apache Spark commented on SPARK-40305:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37756

> Implement Groupby.sem
> -
>
> Key: SPARK-40305
> URL: https://issues.apache.org/jira/browse/SPARK-40305
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40305) Implement Groupby.sem

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40305:


Assignee: Apache Spark

> Implement Groupby.sem
> -
>
> Key: SPARK-40305
> URL: https://issues.apache.org/jira/browse/SPARK-40305
> Project: Spark
>  Issue Type: Improvement
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40149:


Assignee: Apache Spark

> Star expansion after outer join asymmetrically includes joining key
> ---
>
> Key: SPARK-40149
> URL: https://issues.apache.org/jira/browse/SPARK-40149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Otakar Truněček
>Assignee: Apache Spark
>Priority: Blocker
>
> When star expansion is used on left side of a join, the result will include 
> joining key, while on the right side of join it doesn't. I would expect the 
> behaviour to be symmetric (either include on both sides or on neither). 
> Example:
> {code:python}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> spark = SparkSession.builder.getOrCreate()
> df_left = spark.range(5).withColumn('val', f.lit('left'))
> df_right = spark.range(3, 7).withColumn('val', f.lit('right'))
> df_merged = (
> df_left
> .alias('left')
> .join(df_right.alias('right'), on='id', how='full_outer')
> .withColumn('left_all', f.struct('left.*'))
> .withColumn('right_all', f.struct('right.*'))
> )
> df_merged.show()
> {code}
> result:
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|   {0, left}|   {null}|
> |  1|left| null|   {1, left}|   {null}|
> |  2|left| null|   {2, left}|   {null}|
> |  3|left|right|   {3, left}|  {right}|
> |  4|left|right|   {4, left}|  {right}|
> |  5|null|right|{null, null}|  {right}|
> |  6|null|right|{null, null}|  {right}|
> +---++-++-+
> {code}
> This behaviour started with release 3.2.0. Previously the key was not 
> included on either side. 
> Result from Spark 3.1.3
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|  {left}|   {null}|
> |  6|null|right|  {null}|  {right}|
> |  5|null|right|  {null}|  {right}|
> |  1|left| null|  {left}|   {null}|
> |  3|left|right|  {left}|  {right}|
> |  2|left| null|  {left}|   {null}|
> |  4|left|right|  {left}|  {right}|
> +---++-++-+ {code}
> I have a gut feeling this is related to these issues:
> https://issues.apache.org/jira/browse/SPARK-39376
> https://issues.apache.org/jira/browse/SPARK-34527
> https://issues.apache.org/jira/browse/SPARK-38603
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598938#comment-17598938
 ] 

Apache Spark commented on SPARK-40149:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/37758

> Star expansion after outer join asymmetrically includes joining key
> ---
>
> Key: SPARK-40149
> URL: https://issues.apache.org/jira/browse/SPARK-40149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Otakar Truněček
>Priority: Blocker
>
> When star expansion is used on left side of a join, the result will include 
> joining key, while on the right side of join it doesn't. I would expect the 
> behaviour to be symmetric (either include on both sides or on neither). 
> Example:
> {code:python}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> spark = SparkSession.builder.getOrCreate()
> df_left = spark.range(5).withColumn('val', f.lit('left'))
> df_right = spark.range(3, 7).withColumn('val', f.lit('right'))
> df_merged = (
> df_left
> .alias('left')
> .join(df_right.alias('right'), on='id', how='full_outer')
> .withColumn('left_all', f.struct('left.*'))
> .withColumn('right_all', f.struct('right.*'))
> )
> df_merged.show()
> {code}
> result:
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|   {0, left}|   {null}|
> |  1|left| null|   {1, left}|   {null}|
> |  2|left| null|   {2, left}|   {null}|
> |  3|left|right|   {3, left}|  {right}|
> |  4|left|right|   {4, left}|  {right}|
> |  5|null|right|{null, null}|  {right}|
> |  6|null|right|{null, null}|  {right}|
> +---++-++-+
> {code}
> This behaviour started with release 3.2.0. Previously the key was not 
> included on either side. 
> Result from Spark 3.1.3
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|  {left}|   {null}|
> |  6|null|right|  {null}|  {right}|
> |  5|null|right|  {null}|  {right}|
> |  1|left| null|  {left}|   {null}|
> |  3|left|right|  {left}|  {right}|
> |  2|left| null|  {left}|   {null}|
> |  4|left|right|  {left}|  {right}|
> +---++-++-+ {code}
> I have a gut feeling this is related to these issues:
> https://issues.apache.org/jira/browse/SPARK-39376
> https://issues.apache.org/jira/browse/SPARK-34527
> https://issues.apache.org/jira/browse/SPARK-38603
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598939#comment-17598939
 ] 

Apache Spark commented on SPARK-40149:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/37758

> Star expansion after outer join asymmetrically includes joining key
> ---
>
> Key: SPARK-40149
> URL: https://issues.apache.org/jira/browse/SPARK-40149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Otakar Truněček
>Priority: Blocker
>
> When star expansion is used on left side of a join, the result will include 
> joining key, while on the right side of join it doesn't. I would expect the 
> behaviour to be symmetric (either include on both sides or on neither). 
> Example:
> {code:python}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> spark = SparkSession.builder.getOrCreate()
> df_left = spark.range(5).withColumn('val', f.lit('left'))
> df_right = spark.range(3, 7).withColumn('val', f.lit('right'))
> df_merged = (
> df_left
> .alias('left')
> .join(df_right.alias('right'), on='id', how='full_outer')
> .withColumn('left_all', f.struct('left.*'))
> .withColumn('right_all', f.struct('right.*'))
> )
> df_merged.show()
> {code}
> result:
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|   {0, left}|   {null}|
> |  1|left| null|   {1, left}|   {null}|
> |  2|left| null|   {2, left}|   {null}|
> |  3|left|right|   {3, left}|  {right}|
> |  4|left|right|   {4, left}|  {right}|
> |  5|null|right|{null, null}|  {right}|
> |  6|null|right|{null, null}|  {right}|
> +---++-++-+
> {code}
> This behaviour started with release 3.2.0. Previously the key was not 
> included on either side. 
> Result from Spark 3.1.3
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|  {left}|   {null}|
> |  6|null|right|  {null}|  {right}|
> |  5|null|right|  {null}|  {right}|
> |  1|left| null|  {left}|   {null}|
> |  3|left|right|  {left}|  {right}|
> |  2|left| null|  {left}|   {null}|
> |  4|left|right|  {left}|  {right}|
> +---++-++-+ {code}
> I have a gut feeling this is related to these issues:
> https://issues.apache.org/jira/browse/SPARK-39376
> https://issues.apache.org/jira/browse/SPARK-34527
> https://issues.apache.org/jira/browse/SPARK-38603
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40149:


Assignee: (was: Apache Spark)

> Star expansion after outer join asymmetrically includes joining key
> ---
>
> Key: SPARK-40149
> URL: https://issues.apache.org/jira/browse/SPARK-40149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
>Reporter: Otakar Truněček
>Priority: Blocker
>
> When star expansion is used on left side of a join, the result will include 
> joining key, while on the right side of join it doesn't. I would expect the 
> behaviour to be symmetric (either include on both sides or on neither). 
> Example:
> {code:python}
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> spark = SparkSession.builder.getOrCreate()
> df_left = spark.range(5).withColumn('val', f.lit('left'))
> df_right = spark.range(3, 7).withColumn('val', f.lit('right'))
> df_merged = (
> df_left
> .alias('left')
> .join(df_right.alias('right'), on='id', how='full_outer')
> .withColumn('left_all', f.struct('left.*'))
> .withColumn('right_all', f.struct('right.*'))
> )
> df_merged.show()
> {code}
> result:
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|   {0, left}|   {null}|
> |  1|left| null|   {1, left}|   {null}|
> |  2|left| null|   {2, left}|   {null}|
> |  3|left|right|   {3, left}|  {right}|
> |  4|left|right|   {4, left}|  {right}|
> |  5|null|right|{null, null}|  {right}|
> |  6|null|right|{null, null}|  {right}|
> +---++-++-+
> {code}
> This behaviour started with release 3.2.0. Previously the key was not 
> included on either side. 
> Result from Spark 3.1.3
> {code:java}
> +---++-++-+
> | id| val|  val|left_all|right_all|
> +---++-++-+
> |  0|left| null|  {left}|   {null}|
> |  6|null|right|  {null}|  {right}|
> |  5|null|right|  {null}|  {right}|
> |  1|left| null|  {left}|   {null}|
> |  3|left|right|  {left}|  {right}|
> |  2|left| null|  {left}|   {null}|
> |  4|left|right|  {left}|  {right}|
> +---++-++-+ {code}
> I have a gut feeling this is related to these issues:
> https://issues.apache.org/jira/browse/SPARK-39376
> https://issues.apache.org/jira/browse/SPARK-34527
> https://issues.apache.org/jira/browse/SPARK-38603
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40279) Document spark.yarn.report.interval

2022-09-01 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40279:


Assignee: Luca Canali

> Document spark.yarn.report.interval
> ---
>
> Key: SPARK-40279
> URL: https://issues.apache.org/jira/browse/SPARK-40279
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
>
> This proposes to document the configuration paramter 
> spark.yarn.report.interval -> Interval between reports of the current Spark 
> job status in cluster mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40279) Document spark.yarn.report.interval

2022-09-01 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40279.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37731
[https://github.com/apache/spark/pull/37731]

> Document spark.yarn.report.interval
> ---
>
> Key: SPARK-40279
> URL: https://issues.apache.org/jira/browse/SPARK-40279
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.4.0
>
>
> This proposes to document the configuration paramter 
> spark.yarn.report.interval -> Interval between reports of the current Spark 
> job status in cluster mode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36862) ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2022-09-01 Thread Lukas Waldmann (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599013#comment-17599013
 ] 

Lukas Waldmann commented on SPARK-36862:


I manage reproduce the issue in my environment. Problem is on line 192 - 
variable name in function header having array index

Here is the generated code
{code:java}
/* 001 */ public Object generate(Object[] references) {
/* 002 */ return new GeneratedIteratorForCodegenStage636(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=636
/* 006 */ final class GeneratedIteratorForCodegenStage636 extends 
org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */ private Object[] references;
/* 008 */ private scala.collection.Iterator[] inputs;
/* 009 */ private scala.collection.Iterator smj_leftInput_0;
/* 010 */ private scala.collection.Iterator smj_rightInput_0;
/* 011 */ private InternalRow smj_leftRow_0;
/* 012 */ private InternalRow smj_rightRow_0;
/* 013 */ private boolean smj_globalIsNull_0;
/* 014 */ private boolean smj_globalIsNull_1;
/* 015 */ private double smj_value_27;
/* 016 */ private 
org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray smj_matches_0;
/* 017 */ private double smj_value_28;
/* 018 */ private boolean smj_isNull_25;
/* 019 */ private boolean smj_isNull_26;
/* 020 */ private boolean smj_isNull_27;
/* 021 */ private boolean smj_isNull_28;
/* 022 */ private boolean smj_isNull_29;
/* 023 */ private boolean smj_isNull_30;
/* 024 */ private boolean project_subExprIsNull_0;
/* 025 */ private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] 
smj_mutableStateArray_2 = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2];
/* 026 */ private java.util.regex.Pattern[] project_mutableStateArray_0 = new 
java.util.regex.Pattern[1];
/* 027 */ private Decimal[] smj_mutableStateArray_1 = new Decimal[1];
/* 028 */ private String[] project_mutableStateArray_1 = new String[1];
/* 029 */ private UTF8String[] smj_mutableStateArray_0 = new UTF8String[7];
/* 030 */
/* 031 */ public GeneratedIteratorForCodegenStage636(Object[] references) {
/* 032 */ this.references = references;
/* 033 */ }
/* 034 */
/* 035 */ public void init(int index, scala.collection.Iterator[] inputs) {
/* 036 */ partitionIndex = index;
/* 037 */ this.inputs = inputs;
/* 038 */ smj_leftInput_0 = inputs[0];
/* 039 */ smj_rightInput_0 = inputs[1];
/* 040 */
/* 041 */ smj_matches_0 = new 
org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(2147483632, 
2147483647);
/* 042 */ smj_mutableStateArray_2[0] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(6, 192);
/* 043 */ smj_mutableStateArray_2[1] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(6, 192);
/* 044 */
/* 045 */ }
/* 046 */
/* 047 */ private boolean smj_findNextOuterJoinRows_0(
/* 048 */ scala.collection.Iterator leftIter,
/* 049 */ scala.collection.Iterator rightIter) {
/* 050 */ smj_leftRow_0 = null;
/* 051 */ int comp = 0;
/* 052 */ while (smj_leftRow_0 == null) {
/* 053 */ if (!leftIter.hasNext()) return false;
/* 054 */ smj_leftRow_0 = (InternalRow) leftIter.next();
/* 055 */ UTF8String smj_value_22 = smj_If_0(smj_leftRow_0);
/* 056 */ boolean smj_isNull_2 = smj_globalIsNull_1;
/* 057 */ double smj_value_2 = -1.0;
/* 058 */ if (!smj_globalIsNull_1) {
/* 059 */ final String smj_doubleStr_0 = smj_value_22.toString();
/* 060 */ try {
/* 061 */ smj_value_2 = Double.valueOf(smj_doubleStr_0);
/* 062 */ } catch (java.lang.NumberFormatException e) {
/* 063 */ final Double d = (Double) 
Cast.processFloatingPointSpecialLiterals(smj_doubleStr_0, false);
/* 064 */ if (d == null) {
/* 065 */ smj_isNull_2 = true;
/* 066 */ } else {
/* 067 */ smj_value_2 = d.doubleValue();
/* 068 */ }
/* 069 */ }
/* 070 */ }
/* 071 */ boolean smj_isNull_1 = smj_isNull_2;
/* 072 */ double smj_value_1 = -1.0;
/* 073 */
/* 074 */ if (!smj_isNull_2) {
/* 075 */ if (Double.isNaN(smj_value_2)) {
/* 076 */ smj_value_1 = Double.NaN;
/* 077 */ } else if (smj_value_2 == -0.0d) {
/* 078 */ smj_value_1 = 0.0d;
/* 079 */ } else {
/* 080 */ smj_value_1 = smj_value_2;
/* 081 */ }
/* 082 */
/* 083 */ }
/* 084 */ if (smj_isNull_1) {
/* 085 */ if (!smj_matches_0.isEmpty()) {
/* 086 */ smj_matches_0.clear();
/* 087 */ }
/* 088 */ return true;
/* 089 */ }
/* 090 */ if (!smj_matches_0.isEmpty()) {
/* 091 */ comp = 0;
/* 092 */ if (comp == 0) {
/* 093 */ comp = 
org.apache.spark.sql.catalyst.util.SQLOrderingUtil.compareDoubles(smj_value_1, 
smj_value_28);
/* 094 */ }
/* 095 */
/* 096 */ if (comp == 0) {
/* 097 */ return true;
/* 098 */ }
/* 099 */ smj_matches_0.clear();
/* 100 */ }
/* 101 */
/* 102 */ do {
/* 103 */ if (smj_rightRow_0 == null) {
/* 104 */ if (!rightIter.hasNext()) {
/* 105 */ if (!smj_matches_0.isEmpty()) {
/* 106 */ smj_value_28 = smj_value_1;
/* 107 */ }
/* 108 */ return true;
/* 109 */ }
/* 110 */ smj_rightRow_0 = (InternalRow) 

[jira] [Comment Edited] (SPARK-36862) ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2022-09-01 Thread Lukas Waldmann (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599013#comment-17599013
 ] 

Lukas Waldmann edited comment on SPARK-36862 at 9/1/22 2:55 PM:


I managed to reproduce the issue in my environment. Problem is on line 192 - 
variable name in function header having array index

Here is the generated code
{code:java}
/* 001 */ public Object generate(Object[] references) {
/* 002 */ return new GeneratedIteratorForCodegenStage636(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=636
/* 006 */ final class GeneratedIteratorForCodegenStage636 extends 
org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */ private Object[] references;
/* 008 */ private scala.collection.Iterator[] inputs;
/* 009 */ private scala.collection.Iterator smj_leftInput_0;
/* 010 */ private scala.collection.Iterator smj_rightInput_0;
/* 011 */ private InternalRow smj_leftRow_0;
/* 012 */ private InternalRow smj_rightRow_0;
/* 013 */ private boolean smj_globalIsNull_0;
/* 014 */ private boolean smj_globalIsNull_1;
/* 015 */ private double smj_value_27;
/* 016 */ private 
org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray smj_matches_0;
/* 017 */ private double smj_value_28;
/* 018 */ private boolean smj_isNull_25;
/* 019 */ private boolean smj_isNull_26;
/* 020 */ private boolean smj_isNull_27;
/* 021 */ private boolean smj_isNull_28;
/* 022 */ private boolean smj_isNull_29;
/* 023 */ private boolean smj_isNull_30;
/* 024 */ private boolean project_subExprIsNull_0;
/* 025 */ private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] 
smj_mutableStateArray_2 = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2];
/* 026 */ private java.util.regex.Pattern[] project_mutableStateArray_0 = new 
java.util.regex.Pattern[1];
/* 027 */ private Decimal[] smj_mutableStateArray_1 = new Decimal[1];
/* 028 */ private String[] project_mutableStateArray_1 = new String[1];
/* 029 */ private UTF8String[] smj_mutableStateArray_0 = new UTF8String[7];
/* 030 */
/* 031 */ public GeneratedIteratorForCodegenStage636(Object[] references) {
/* 032 */ this.references = references;
/* 033 */ }
/* 034 */
/* 035 */ public void init(int index, scala.collection.Iterator[] inputs) {
/* 036 */ partitionIndex = index;
/* 037 */ this.inputs = inputs;
/* 038 */ smj_leftInput_0 = inputs[0];
/* 039 */ smj_rightInput_0 = inputs[1];
/* 040 */
/* 041 */ smj_matches_0 = new 
org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(2147483632, 
2147483647);
/* 042 */ smj_mutableStateArray_2[0] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(6, 192);
/* 043 */ smj_mutableStateArray_2[1] = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(6, 192);
/* 044 */
/* 045 */ }
/* 046 */
/* 047 */ private boolean smj_findNextOuterJoinRows_0(
/* 048 */ scala.collection.Iterator leftIter,
/* 049 */ scala.collection.Iterator rightIter) {
/* 050 */ smj_leftRow_0 = null;
/* 051 */ int comp = 0;
/* 052 */ while (smj_leftRow_0 == null) {
/* 053 */ if (!leftIter.hasNext()) return false;
/* 054 */ smj_leftRow_0 = (InternalRow) leftIter.next();
/* 055 */ UTF8String smj_value_22 = smj_If_0(smj_leftRow_0);
/* 056 */ boolean smj_isNull_2 = smj_globalIsNull_1;
/* 057 */ double smj_value_2 = -1.0;
/* 058 */ if (!smj_globalIsNull_1) {
/* 059 */ final String smj_doubleStr_0 = smj_value_22.toString();
/* 060 */ try {
/* 061 */ smj_value_2 = Double.valueOf(smj_doubleStr_0);
/* 062 */ } catch (java.lang.NumberFormatException e) {
/* 063 */ final Double d = (Double) 
Cast.processFloatingPointSpecialLiterals(smj_doubleStr_0, false);
/* 064 */ if (d == null) {
/* 065 */ smj_isNull_2 = true;
/* 066 */ } else {
/* 067 */ smj_value_2 = d.doubleValue();
/* 068 */ }
/* 069 */ }
/* 070 */ }
/* 071 */ boolean smj_isNull_1 = smj_isNull_2;
/* 072 */ double smj_value_1 = -1.0;
/* 073 */
/* 074 */ if (!smj_isNull_2) {
/* 075 */ if (Double.isNaN(smj_value_2)) {
/* 076 */ smj_value_1 = Double.NaN;
/* 077 */ } else if (smj_value_2 == -0.0d) {
/* 078 */ smj_value_1 = 0.0d;
/* 079 */ } else {
/* 080 */ smj_value_1 = smj_value_2;
/* 081 */ }
/* 082 */
/* 083 */ }
/* 084 */ if (smj_isNull_1) {
/* 085 */ if (!smj_matches_0.isEmpty()) {
/* 086 */ smj_matches_0.clear();
/* 087 */ }
/* 088 */ return true;
/* 089 */ }
/* 090 */ if (!smj_matches_0.isEmpty()) {
/* 091 */ comp = 0;
/* 092 */ if (comp == 0) {
/* 093 */ comp = 
org.apache.spark.sql.catalyst.util.SQLOrderingUtil.compareDoubles(smj_value_1, 
smj_value_28);
/* 094 */ }
/* 095 */
/* 096 */ if (comp == 0) {
/* 097 */ return true;
/* 098 */ }
/* 099 */ smj_matches_0.clear();
/* 100 */ }
/* 101 */
/* 102 */ do {
/* 103 */ if (smj_rightRow_0 == null) {
/* 104 */ if (!rightIter.hasNext()) {
/* 105 */ if (!smj_matches_0.isEmpty()) {
/* 106 */ smj_value_28 = smj_value_1;
/* 107 */ }
/* 108 */ return true;
/*

[jira] [Created] (SPARK-40306) Support more than Integer.MAX_VALUE of the same join key

2022-09-01 Thread Wan Kun (Jira)
Wan Kun created SPARK-40306:
---

 Summary: Support more than Integer.MAX_VALUE of the same join key
 Key: SPARK-40306
 URL: https://issues.apache.org/jira/browse/SPARK-40306
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Wan Kun


For SMJ, the number of the same join key records of the right table is greater 
than Integer.MAX_VALUE, the result will be incorrect. 
Before SMJ JOIN, we will put the records of the same join key into the 
ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows 
overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex 
overflow may cause incorrect result.

For example, one of our production table:

!image-2022-09-01-23-01-46-282.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40306) Support more than Integer.MAX_VALUE of the same join key

2022-09-01 Thread Wan Kun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-40306:

Description: 
For SMJ, the number of the same join key records of the right table is greater 
than Integer.MAX_VALUE, the result will be incorrect. 
Before SMJ JOIN, we will put the records of the same join key into the 
ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows 
overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex 
overflow may cause incorrect result.

For example, one of our production table:

!image-2022-09-01-23-02-15-955.png!

  was:
For SMJ, the number of the same join key records of the right table is greater 
than Integer.MAX_VALUE, the result will be incorrect. 
Before SMJ JOIN, we will put the records of the same join key into the 
ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows 
overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex 
overflow may cause incorrect result.

For example, one of our production table:

!image-2022-09-01-23-01-46-282.png!


> Support more than Integer.MAX_VALUE of the same join key
> 
>
> Key: SPARK-40306
> URL: https://issues.apache.org/jira/browse/SPARK-40306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Priority: Major
> Attachments: image-2022-09-01-23-02-15-955.png
>
>
> For SMJ, the number of the same join key records of the right table is 
> greater than Integer.MAX_VALUE, the result will be incorrect. 
> Before SMJ JOIN, we will put the records of the same join key into the 
> ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows 
> overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex 
> overflow may cause incorrect result.
> For example, one of our production table:
> !image-2022-09-01-23-02-15-955.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40306) Support more than Integer.MAX_VALUE of the same join key

2022-09-01 Thread Wan Kun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-40306:

Attachment: image-2022-09-01-23-02-15-955.png

> Support more than Integer.MAX_VALUE of the same join key
> 
>
> Key: SPARK-40306
> URL: https://issues.apache.org/jira/browse/SPARK-40306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Priority: Major
> Attachments: image-2022-09-01-23-02-15-955.png
>
>
> For SMJ, the number of the same join key records of the right table is 
> greater than Integer.MAX_VALUE, the result will be incorrect. 
> Before SMJ JOIN, we will put the records of the same join key into the 
> ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows 
> overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex 
> overflow may cause incorrect result.
> For example, one of our production table:
> !image-2022-09-01-23-01-46-282.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40306) Support more than Integer.MAX_VALUE of the same join key

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40306:


Assignee: Apache Spark

> Support more than Integer.MAX_VALUE of the same join key
> 
>
> Key: SPARK-40306
> URL: https://issues.apache.org/jira/browse/SPARK-40306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Assignee: Apache Spark
>Priority: Major
> Attachments: image-2022-09-01-23-02-15-955.png
>
>
> For SMJ, the number of the same join key records of the right table is 
> greater than Integer.MAX_VALUE, the result will be incorrect. 
> Before SMJ JOIN, we will put the records of the same join key into the 
> ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows 
> overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex 
> overflow may cause incorrect result.
> For example, one of our production table:
> !image-2022-09-01-23-02-15-955.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40306) Support more than Integer.MAX_VALUE of the same join key

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599023#comment-17599023
 ] 

Apache Spark commented on SPARK-40306:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/37759

> Support more than Integer.MAX_VALUE of the same join key
> 
>
> Key: SPARK-40306
> URL: https://issues.apache.org/jira/browse/SPARK-40306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Priority: Major
> Attachments: image-2022-09-01-23-02-15-955.png
>
>
> For SMJ, the number of the same join key records of the right table is 
> greater than Integer.MAX_VALUE, the result will be incorrect. 
> Before SMJ JOIN, we will put the records of the same join key into the 
> ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows 
> overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex 
> overflow may cause incorrect result.
> For example, one of our production table:
> !image-2022-09-01-23-02-15-955.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40306) Support more than Integer.MAX_VALUE of the same join key

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599021#comment-17599021
 ] 

Apache Spark commented on SPARK-40306:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/37759

> Support more than Integer.MAX_VALUE of the same join key
> 
>
> Key: SPARK-40306
> URL: https://issues.apache.org/jira/browse/SPARK-40306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Priority: Major
> Attachments: image-2022-09-01-23-02-15-955.png
>
>
> For SMJ, the number of the same join key records of the right table is 
> greater than Integer.MAX_VALUE, the result will be incorrect. 
> Before SMJ JOIN, we will put the records of the same join key into the 
> ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows 
> overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex 
> overflow may cause incorrect result.
> For example, one of our production table:
> !image-2022-09-01-23-02-15-955.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40306) Support more than Integer.MAX_VALUE of the same join key

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40306:


Assignee: (was: Apache Spark)

> Support more than Integer.MAX_VALUE of the same join key
> 
>
> Key: SPARK-40306
> URL: https://issues.apache.org/jira/browse/SPARK-40306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Priority: Major
> Attachments: image-2022-09-01-23-02-15-955.png
>
>
> For SMJ, the number of the same join key records of the right table is 
> greater than Integer.MAX_VALUE, the result will be incorrect. 
> Before SMJ JOIN, we will put the records of the same join key into the 
> ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows 
> overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex 
> overflow may cause incorrect result.
> For example, one of our production table:
> !image-2022-09-01-23-02-15-955.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38404) Spark does not find CTE inside nested CTE

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599028#comment-17599028
 ] 

Apache Spark commented on SPARK-38404:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/37760

> Spark does not find CTE inside nested CTE
> -
>
> Key: SPARK-38404
> URL: https://issues.apache.org/jira/browse/SPARK-38404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
> Environment: Tested on:
>  * MacOS Monterrey 12.2.1 (21D62)
>  * python 3.9.10
>  * pip 22.0.3
>  * pyspark 3.2.0 & 3.2.1 (SQL query does not work) and pyspark 3.0.1 and 
> 3.1.3 (SQL query works)
>Reporter: Joan Heredia Rius
>Assignee: Peter Toth
>Priority: Minor
> Fix For: 3.4.0
>
>
> Hello! 
> Seems that when defining CTEs and using them inside another CTE in Spark SQL, 
> Spark thinks the inner call for the CTE is a table or view, which is not 
> found and then it errors with `Table or view not found: `
> h3. Steps to reproduce
>  # `pip install pyspark==3.2.0` (also happens with 3.2.1)
>  # start pyspark console by typing `pyspark` in the terminal
>  # Try to run the following SQL with `spark.sql(sql)`
>  
> {code:java}
>   WITH mock_cte__usersAS (
>SELECT 1 AS id
>),
>model_under_test  AS (
>  WITH usersAS (
>   SELECT *
> FROM mock_cte__users
>   )
>SELECT *
>  FROM users
>)
> SELECT *
>   FROM model_under_test;{code}
> Spark will fail with 
>  
> {code:java}
> pyspark.sql.utils.AnalysisException: Table or view not found: 
> mock_cte__users; line 8 pos 29; {code}
> I don't know if this is a regression or an expected behavior of the new 3.2.* 
> versions. This fix introduced in 3.2.0 might be related: 
> https://issues.apache.org/jira/browse/SPARK-36447
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38404) Spark does not find CTE inside nested CTE

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599027#comment-17599027
 ] 

Apache Spark commented on SPARK-38404:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/37760

> Spark does not find CTE inside nested CTE
> -
>
> Key: SPARK-38404
> URL: https://issues.apache.org/jira/browse/SPARK-38404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
> Environment: Tested on:
>  * MacOS Monterrey 12.2.1 (21D62)
>  * python 3.9.10
>  * pip 22.0.3
>  * pyspark 3.2.0 & 3.2.1 (SQL query does not work) and pyspark 3.0.1 and 
> 3.1.3 (SQL query works)
>Reporter: Joan Heredia Rius
>Assignee: Peter Toth
>Priority: Minor
> Fix For: 3.4.0
>
>
> Hello! 
> Seems that when defining CTEs and using them inside another CTE in Spark SQL, 
> Spark thinks the inner call for the CTE is a table or view, which is not 
> found and then it errors with `Table or view not found: `
> h3. Steps to reproduce
>  # `pip install pyspark==3.2.0` (also happens with 3.2.1)
>  # start pyspark console by typing `pyspark` in the terminal
>  # Try to run the following SQL with `spark.sql(sql)`
>  
> {code:java}
>   WITH mock_cte__usersAS (
>SELECT 1 AS id
>),
>model_under_test  AS (
>  WITH usersAS (
>   SELECT *
> FROM mock_cte__users
>   )
>SELECT *
>  FROM users
>)
> SELECT *
>   FROM model_under_test;{code}
> Spark will fail with 
>  
> {code:java}
> pyspark.sql.utils.AnalysisException: Table or view not found: 
> mock_cte__users; line 8 pos 29; {code}
> I don't know if this is a regression or an expected behavior of the new 3.2.* 
> versions. This fix introduced in 3.2.0 might be related: 
> https://issues.apache.org/jira/browse/SPARK-36447
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40302) Add YuniKornSuite

2022-09-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-40302:
-

Assignee: Dongjoon Hyun

> Add YuniKornSuite
> -
>
> Key: SPARK-40302
> URL: https://issues.apache.org/jira/browse/SPARK-40302
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40302) Add YuniKornSuite

2022-09-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40302.
---
Fix Version/s: 3.3.1
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37753
[https://github.com/apache/spark/pull/37753]

> Add YuniKornSuite
> -
>
> Key: SPARK-40302
> URL: https://issues.apache.org/jira/browse/SPARK-40302
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40304) Add decomTestTag to K8s Integration Test

2022-09-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40304.
---
Fix Version/s: 3.4.0
   3.3.1
 Assignee: Dongjoon Hyun
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/37755

> Add decomTestTag to K8s Integration Test
> 
>
> Key: SPARK-40304
> URL: https://issues.apache.org/jira/browse/SPARK-40304
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40302) Add YuniKornSuite

2022-09-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40302:
--
Parent: SPARK-36057
Issue Type: Sub-task  (was: Test)

> Add YuniKornSuite
> -
>
> Key: SPARK-40302
> URL: https://issues.apache.org/jira/browse/SPARK-40302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39996) Upgrade postgresql to 42.5.0

2022-09-01 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-39996:

Summary: Upgrade postgresql to 42.5.0  (was: Upgrade postgresql to 42.4.1)

> Upgrade postgresql to 42.5.0
> 
>
> Key: SPARK-39996
> URL: https://issues.apache.org/jira/browse/SPARK-39996
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> Security
> - fix: 
> [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197]
>  Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so 
> as to prevent SQL injection.
>   - Previously, the column names for both key and data columns in the table 
> were copied as-is into the generated
>   SQL. This allowed a malicious table with column names that include 
> statement terminator to be parsed and
>   executed as multiple separate commands.
>   - Also adds a new test class ResultSetRefreshTest to verify this change.
>   - Reported by [Sho Kato](https://github.com/kato-sho)
> [Release 
> note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39996) Upgrade postgresql to 42.5.0

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39996:


Assignee: Apache Spark

> Upgrade postgresql to 42.5.0
> 
>
> Key: SPARK-39996
> URL: https://issues.apache.org/jira/browse/SPARK-39996
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Apache Spark
>Priority: Major
>
> Security
> - fix: 
> [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197]
>  Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so 
> as to prevent SQL injection.
>   - Previously, the column names for both key and data columns in the table 
> were copied as-is into the generated
>   SQL. This allowed a malicious table with column names that include 
> statement terminator to be parsed and
>   executed as multiple separate commands.
>   - Also adds a new test class ResultSetRefreshTest to verify this change.
>   - Reported by [Sho Kato](https://github.com/kato-sho)
> [Release 
> note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39996) Upgrade postgresql to 42.5.0

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599103#comment-17599103
 ] 

Apache Spark commented on SPARK-39996:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/37762

> Upgrade postgresql to 42.5.0
> 
>
> Key: SPARK-39996
> URL: https://issues.apache.org/jira/browse/SPARK-39996
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> Security
> - fix: 
> [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197]
>  Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so 
> as to prevent SQL injection.
>   - Previously, the column names for both key and data columns in the table 
> were copied as-is into the generated
>   SQL. This allowed a malicious table with column names that include 
> statement terminator to be parsed and
>   executed as multiple separate commands.
>   - Also adds a new test class ResultSetRefreshTest to verify this change.
>   - Reported by [Sho Kato](https://github.com/kato-sho)
> [Release 
> note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-39996) Upgrade postgresql to 42.5.0

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39996:


Assignee: (was: Apache Spark)

> Upgrade postgresql to 42.5.0
> 
>
> Key: SPARK-39996
> URL: https://issues.apache.org/jira/browse/SPARK-39996
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> Security
> - fix: 
> [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197]
>  Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so 
> as to prevent SQL injection.
>   - Previously, the column names for both key and data columns in the table 
> were copied as-is into the generated
>   SQL. This allowed a malicious table with column names that include 
> statement terminator to be parsed and
>   executed as multiple separate commands.
>   - Also adds a new test class ResultSetRefreshTest to verify this change.
>   - Reported by [Sho Kato](https://github.com/kato-sho)
> [Release 
> note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37809) Add yunikorn feature step

2022-09-01 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved SPARK-37809.
-
Resolution: Won't Do

Based on the feedback from the community, there is no need to add the extra 
feature step. YuniKorn is able to work with the current version of Spark 
without any code changes. Please see doc 
https://issues.apache.org/jira/browse/SPARK-40187.

> Add yunikorn feature step
> -
>
> Key: SPARK-37809
> URL: https://issues.apache.org/jira/browse/SPARK-37809
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Weiwei Yang
>Priority: Major
>
> Based on the design doc 
> https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg.
>  This task is to add YuniKornFeatureStep in order to integrate with YuniKorn.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38310) Support job queue in YuniKorn feature step

2022-09-01 Thread Weiwei Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang resolved SPARK-38310.
-
Resolution: Won't Do

YuniKorn will follow the standard Spark API for the integration, there is no 
need to add code changes to support YuniKorn queues. 

> Support job queue in YuniKorn feature step
> --
>
> Key: SPARK-38310
> URL: https://issues.apache.org/jira/browse/SPARK-38310
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.2
> Environment: Like SPARK-38188, yunikorn needs to support the queue 
> property and use that in the feature step to properly configure 
> driver/executor pods.
>Reporter: Weiwei Yang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40307) Optimize (De)Serialization of Python UDF

2022-09-01 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-40307:


 Summary: Optimize (De)Serialization of Python UDF
 Key: SPARK-40307
 URL: https://issues.apache.org/jira/browse/SPARK-40307
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Python user-defined function (UDF) enables users to run arbitrary code against 
PySpark columns. It uses Pickle for (de)serialization, and executes row by row.

One major performance bottleneck of Python UDFs is (de)serialization, that is, 
the data interchanging between the worker JVM and the spawned Python subprocess 
which actually executes the UDF. We should seek for an alternative to handle 
the (de)serialization: Arrow, which is used in (de)serialization of Pandas UDF 
already.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33605) Add GCS FS/connector config (dependencies?) akin to S3

2022-09-01 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599160#comment-17599160
 ] 

Dongjoon Hyun edited comment on SPARK-33605 at 9/1/22 9:33 PM:
---

I made a PR and had a discussion, but this issue is closed based on the 
following.
bq. note that the gcs connector (at leasts the builds off their master) are 
java 11 only; not sure where that stands w.r.t older releases

Apache Spark community has no plan to drop Java 8 yet.


was (Author: dongjoon):
I made a PR and had a discussion, but this issue is closed based on the 
following.
> note that the gcs connector (at leasts the builds off their master) are java 
> 11 only; not sure where that stands w.r.t older releases

Apache Spark community has no plan to drop Java 8 yet.

> Add GCS FS/connector config (dependencies?) akin to S3
> --
>
> Key: SPARK-33605
> URL: https://issues.apache.org/jira/browse/SPARK-33605
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.1
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Spark comes with some S3 batteries included, which makes it easier to use 
> with S3, for GCS to work users are required to manually configure the jars. 
> This is especially problematic for python users who may not be accustomed to 
> java dependencies etc. This is an example of workaround for pyspark: 
> [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the 
> [GCS 
> connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage],
>  it would make things easier for GCS users.
> Please let me know what you think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33605) Add GCS FS/connector config (dependencies?) akin to S3

2022-09-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33605.
---
Resolution: Won't Do

I made a PR and had a discussion, but this issue is closed based on the 
following.
> note that the gcs connector (at leasts the builds off their master) are java 
> 11 only; not sure where that stands w.r.t older releases

Apache Spark community has no plan to drop Java 8 yet.

> Add GCS FS/connector config (dependencies?) akin to S3
> --
>
> Key: SPARK-33605
> URL: https://issues.apache.org/jira/browse/SPARK-33605
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.1
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Spark comes with some S3 batteries included, which makes it easier to use 
> with S3, for GCS to work users are required to manually configure the jars. 
> This is especially problematic for python users who may not be accustomed to 
> java dependencies etc. This is an example of workaround for pyspark: 
> [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the 
> [GCS 
> connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage],
>  it would make things easier for GCS users.
> Please let me know what you think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-33605) Add GCS FS/connector config (dependencies?) akin to S3

2022-09-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-33605:
---

> Add GCS FS/connector config (dependencies?) akin to S3
> --
>
> Key: SPARK-33605
> URL: https://issues.apache.org/jira/browse/SPARK-33605
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.1
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Spark comes with some S3 batteries included, which makes it easier to use 
> with S3, for GCS to work users are required to manually configure the jars. 
> This is especially problematic for python users who may not be accustomed to 
> java dependencies etc. This is an example of workaround for pyspark: 
> [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the 
> [GCS 
> connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage],
>  it would make things easier for GCS users.
> Please let me know what you think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33605) Add GCS FS/connector config (dependencies?) akin to S3

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33605:


Assignee: (was: Apache Spark)

> Add GCS FS/connector config (dependencies?) akin to S3
> --
>
> Key: SPARK-33605
> URL: https://issues.apache.org/jira/browse/SPARK-33605
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.1
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Spark comes with some S3 batteries included, which makes it easier to use 
> with S3, for GCS to work users are required to manually configure the jars. 
> This is especially problematic for python users who may not be accustomed to 
> java dependencies etc. This is an example of workaround for pyspark: 
> [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the 
> [GCS 
> connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage],
>  it would make things easier for GCS users.
> Please let me know what you think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33605) Add GCS FS/connector config (dependencies?) akin to S3

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33605:


Assignee: Apache Spark

> Add GCS FS/connector config (dependencies?) akin to S3
> --
>
> Key: SPARK-33605
> URL: https://issues.apache.org/jira/browse/SPARK-33605
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.1
>Reporter: Rafal Wojdyla
>Assignee: Apache Spark
>Priority: Major
>
> Spark comes with some S3 batteries included, which makes it easier to use 
> with S3, for GCS to work users are required to manually configure the jars. 
> This is especially problematic for python users who may not be accustomed to 
> java dependencies etc. This is an example of workaround for pyspark: 
> [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the 
> [GCS 
> connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage],
>  it would make things easier for GCS users.
> Please let me know what you think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33605) Add GCS FS/connector config (dependencies?) akin to S3

2022-09-01 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599162#comment-17599162
 ] 

Dongjoon Hyun commented on SPARK-33605:
---

{{My bad. It was Java 8.}}

{{- 
https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/8453ce7ce7510e983bae7470909fbd02704c0539/pom.xml#L76-L77}}
{quote}{{8}}
{{8}}
{quote}

> Add GCS FS/connector config (dependencies?) akin to S3
> --
>
> Key: SPARK-33605
> URL: https://issues.apache.org/jira/browse/SPARK-33605
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.1
>Reporter: Rafal Wojdyla
>Priority: Major
>
> Spark comes with some S3 batteries included, which makes it easier to use 
> with S3, for GCS to work users are required to manually configure the jars. 
> This is especially problematic for python users who may not be accustomed to 
> java dependencies etc. This is an example of workaround for pyspark: 
> [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the 
> [GCS 
> connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage],
>  it would make things easier for GCS users.
> Please let me know what you think.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40308) str_to_map should accept non-foldable delimiter parameters

2022-09-01 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-40308:
-

 Summary: str_to_map should accept non-foldable delimiter parameters
 Key: SPARK-40308
 URL: https://issues.apache.org/jira/browse/SPARK-40308
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Bruce Robbins


Currently, str_to_map requires the delimiter parameters to be foldable 
expressions. For example, the following doesn't work in Spark SQL:
{noformat}
drop table if exists maptbl;
create table maptbl as select ',' as del1, ':' as del2, 'a:1,b:2,c:3' as str;
insert into table maptbl select '%' as del1, '-' as del2, 'a-1%b-2%c-3' as str;
select str, str_to_map(str, del1, del2) from maptbl;
{noformat}
You get the following error:
{noformat}
str_to_map's delimiters must be foldable.; line 1 pos 12;
{noformat}
However, the above example SQL statements do work in Hive 2.3.9. There, you get:
{noformat}
+--++
| str  |_c1 |
+--++
| a:1,b:2,c:3  | {"a":"1","b":"2","c":"3"}  |
| a-1%b-2%c-3  | {"a":"1","b":"2","c":"3"}  |
+--++
2 rows selected (0.13 seconds)
{noformat}
It's unlikely that an input table would have the needed delimiters in columns. 
The use-case is more likely to be something like this, where the delimiters are 
determined based on some other value:
{noformat}
select
  str,
  str_to_map(str, ',', if(region = 0, ':', '#')) as m
from
  maptbl2;
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40308) str_to_map should accept non-foldable delimiter arguments

2022-09-01 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-40308:
--
Summary: str_to_map should accept non-foldable delimiter arguments  (was: 
str_to_map should accept non-foldable delimiter parameters)

> str_to_map should accept non-foldable delimiter arguments
> -
>
> Key: SPARK-40308
> URL: https://issues.apache.org/jira/browse/SPARK-40308
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> Currently, str_to_map requires the delimiter parameters to be foldable 
> expressions. For example, the following doesn't work in Spark SQL:
> {noformat}
> drop table if exists maptbl;
> create table maptbl as select ',' as del1, ':' as del2, 'a:1,b:2,c:3' as str;
> insert into table maptbl select '%' as del1, '-' as del2, 'a-1%b-2%c-3' as 
> str;
> select str, str_to_map(str, del1, del2) from maptbl;
> {noformat}
> You get the following error:
> {noformat}
> str_to_map's delimiters must be foldable.; line 1 pos 12;
> {noformat}
> However, the above example SQL statements do work in Hive 2.3.9. There, you 
> get:
> {noformat}
> +--++
> | str  |_c1 |
> +--++
> | a:1,b:2,c:3  | {"a":"1","b":"2","c":"3"}  |
> | a-1%b-2%c-3  | {"a":"1","b":"2","c":"3"}  |
> +--++
> 2 rows selected (0.13 seconds)
> {noformat}
> It's unlikely that an input table would have the needed delimiters in 
> columns. The use-case is more likely to be something like this, where the 
> delimiters are determined based on some other value:
> {noformat}
> select
>   str,
>   str_to_map(str, ',', if(region = 0, ':', '#')) as m
> from
>   maptbl2;
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40308) str_to_map should accept non-foldable delimiter arguments

2022-09-01 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-40308:
--
Description: 
Currently, str_to_map requires the delimiter arguments to be foldable 
expressions. For example, the following doesn't work in Spark SQL:
{noformat}
drop table if exists maptbl;
create table maptbl as select ',' as del1, ':' as del2, 'a:1,b:2,c:3' as str;
insert into table maptbl select '%' as del1, '-' as del2, 'a-1%b-2%c-3' as str;
select str, str_to_map(str, del1, del2) from maptbl;
{noformat}
You get the following error:
{noformat}
str_to_map's delimiters must be foldable.; line 1 pos 12;
{noformat}
However, the above example SQL statements do work in Hive 2.3.9. There, you get:
{noformat}
+--++
| str  |_c1 |
+--++
| a:1,b:2,c:3  | {"a":"1","b":"2","c":"3"}  |
| a-1%b-2%c-3  | {"a":"1","b":"2","c":"3"}  |
+--++
2 rows selected (0.13 seconds)
{noformat}
It's unlikely that an input table would have the needed delimiters in columns. 
The use-case is more likely to be something like this, where the delimiters are 
determined based on some other value:
{noformat}
select
  str,
  str_to_map(str, ',', if(region = 0, ':', '#')) as m
from
  maptbl2;
{noformat}

  was:
Currently, str_to_map requires the delimiter parameters to be foldable 
expressions. For example, the following doesn't work in Spark SQL:
{noformat}
drop table if exists maptbl;
create table maptbl as select ',' as del1, ':' as del2, 'a:1,b:2,c:3' as str;
insert into table maptbl select '%' as del1, '-' as del2, 'a-1%b-2%c-3' as str;
select str, str_to_map(str, del1, del2) from maptbl;
{noformat}
You get the following error:
{noformat}
str_to_map's delimiters must be foldable.; line 1 pos 12;
{noformat}
However, the above example SQL statements do work in Hive 2.3.9. There, you get:
{noformat}
+--++
| str  |_c1 |
+--++
| a:1,b:2,c:3  | {"a":"1","b":"2","c":"3"}  |
| a-1%b-2%c-3  | {"a":"1","b":"2","c":"3"}  |
+--++
2 rows selected (0.13 seconds)
{noformat}
It's unlikely that an input table would have the needed delimiters in columns. 
The use-case is more likely to be something like this, where the delimiters are 
determined based on some other value:
{noformat}
select
  str,
  str_to_map(str, ',', if(region = 0, ':', '#')) as m
from
  maptbl2;
{noformat}


> str_to_map should accept non-foldable delimiter arguments
> -
>
> Key: SPARK-40308
> URL: https://issues.apache.org/jira/browse/SPARK-40308
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> Currently, str_to_map requires the delimiter arguments to be foldable 
> expressions. For example, the following doesn't work in Spark SQL:
> {noformat}
> drop table if exists maptbl;
> create table maptbl as select ',' as del1, ':' as del2, 'a:1,b:2,c:3' as str;
> insert into table maptbl select '%' as del1, '-' as del2, 'a-1%b-2%c-3' as 
> str;
> select str, str_to_map(str, del1, del2) from maptbl;
> {noformat}
> You get the following error:
> {noformat}
> str_to_map's delimiters must be foldable.; line 1 pos 12;
> {noformat}
> However, the above example SQL statements do work in Hive 2.3.9. There, you 
> get:
> {noformat}
> +--++
> | str  |_c1 |
> +--++
> | a:1,b:2,c:3  | {"a":"1","b":"2","c":"3"}  |
> | a-1%b-2%c-3  | {"a":"1","b":"2","c":"3"}  |
> +--++
> 2 rows selected (0.13 seconds)
> {noformat}
> It's unlikely that an input table would have the needed delimiters in 
> columns. The use-case is more likely to be something like this, where the 
> delimiters are determined based on some other value:
> {noformat}
> select
>   str,
>   str_to_map(str, ',', if(region = 0, ':', '#')) as m
> from
>   maptbl2;
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40309) Introduce sql_conf context manager for pyspark.sql

2022-09-01 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-40309:


 Summary: Introduce sql_conf context manager for pyspark.sql
 Key: SPARK-40309
 URL: https://issues.apache.org/jira/browse/SPARK-40309
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


[https://github.com/apache/spark/blob/master/python/pyspark/pandas/utils.py#L490]

a context manager is introduced to set the Spark SQL configuration and
then restores it back when it exits, in Pandas API on Spark.

That simplifies the control of Spark SQL configuration, 

from
{code:java}
original_value = spark.conf.get("key")
spark.conf.set("key", "value")
...
spark.conf.set("key", original_value){code}
to

 

 
{code:java}
with sql_conf({"key": "value"}):
...
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40309) Introduce sql_conf context manager for pyspark.sql

2022-09-01 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-40309:
-
Description: 
[https://github.com/apache/spark/blob/master/python/pyspark/pandas/utils.py#L490]

a context manager is introduced to set the Spark SQL configuration and
then restores it back when it exits, in Pandas API on Spark.

That simplifies the control of Spark SQL configuration, 

from
{code:java}
original_value = spark.conf.get("key")
spark.conf.set("key", "value")
...
spark.conf.set("key", original_value){code}
to
{code:java}
with sql_conf({"key": "value"}):
...
{code}
 

 

  was:
[https://github.com/apache/spark/blob/master/python/pyspark/pandas/utils.py#L490]

a context manager is introduced to set the Spark SQL configuration and
then restores it back when it exits, in Pandas API on Spark.

That simplifies the control of Spark SQL configuration, 

from
{code:java}
original_value = spark.conf.get("key")
spark.conf.set("key", "value")
...
spark.conf.set("key", original_value){code}
to

 

 
{code:java}
with sql_conf({"key": "value"}):
...
{code}
 

 


> Introduce sql_conf context manager for pyspark.sql
> --
>
> Key: SPARK-40309
> URL: https://issues.apache.org/jira/browse/SPARK-40309
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> [https://github.com/apache/spark/blob/master/python/pyspark/pandas/utils.py#L490]
> a context manager is introduced to set the Spark SQL configuration and
> then restores it back when it exits, in Pandas API on Spark.
> That simplifies the control of Spark SQL configuration, 
> from
> {code:java}
> original_value = spark.conf.get("key")
> spark.conf.set("key", "value")
> ...
> spark.conf.set("key", original_value){code}
> to
> {code:java}
> with sql_conf({"key": "value"}):
> ...
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40308) str_to_map should accept non-foldable delimiter arguments

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599174#comment-17599174
 ] 

Apache Spark commented on SPARK-40308:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/37763

> str_to_map should accept non-foldable delimiter arguments
> -
>
> Key: SPARK-40308
> URL: https://issues.apache.org/jira/browse/SPARK-40308
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> Currently, str_to_map requires the delimiter arguments to be foldable 
> expressions. For example, the following doesn't work in Spark SQL:
> {noformat}
> drop table if exists maptbl;
> create table maptbl as select ',' as del1, ':' as del2, 'a:1,b:2,c:3' as str;
> insert into table maptbl select '%' as del1, '-' as del2, 'a-1%b-2%c-3' as 
> str;
> select str, str_to_map(str, del1, del2) from maptbl;
> {noformat}
> You get the following error:
> {noformat}
> str_to_map's delimiters must be foldable.; line 1 pos 12;
> {noformat}
> However, the above example SQL statements do work in Hive 2.3.9. There, you 
> get:
> {noformat}
> +--++
> | str  |_c1 |
> +--++
> | a:1,b:2,c:3  | {"a":"1","b":"2","c":"3"}  |
> | a-1%b-2%c-3  | {"a":"1","b":"2","c":"3"}  |
> +--++
> 2 rows selected (0.13 seconds)
> {noformat}
> It's unlikely that an input table would have the needed delimiters in 
> columns. The use-case is more likely to be something like this, where the 
> delimiters are determined based on some other value:
> {noformat}
> select
>   str,
>   str_to_map(str, ',', if(region = 0, ':', '#')) as m
> from
>   maptbl2;
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40308) str_to_map should accept non-foldable delimiter arguments

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40308:


Assignee: (was: Apache Spark)

> str_to_map should accept non-foldable delimiter arguments
> -
>
> Key: SPARK-40308
> URL: https://issues.apache.org/jira/browse/SPARK-40308
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> Currently, str_to_map requires the delimiter arguments to be foldable 
> expressions. For example, the following doesn't work in Spark SQL:
> {noformat}
> drop table if exists maptbl;
> create table maptbl as select ',' as del1, ':' as del2, 'a:1,b:2,c:3' as str;
> insert into table maptbl select '%' as del1, '-' as del2, 'a-1%b-2%c-3' as 
> str;
> select str, str_to_map(str, del1, del2) from maptbl;
> {noformat}
> You get the following error:
> {noformat}
> str_to_map's delimiters must be foldable.; line 1 pos 12;
> {noformat}
> However, the above example SQL statements do work in Hive 2.3.9. There, you 
> get:
> {noformat}
> +--++
> | str  |_c1 |
> +--++
> | a:1,b:2,c:3  | {"a":"1","b":"2","c":"3"}  |
> | a-1%b-2%c-3  | {"a":"1","b":"2","c":"3"}  |
> +--++
> 2 rows selected (0.13 seconds)
> {noformat}
> It's unlikely that an input table would have the needed delimiters in 
> columns. The use-case is more likely to be something like this, where the 
> delimiters are determined based on some other value:
> {noformat}
> select
>   str,
>   str_to_map(str, ',', if(region = 0, ':', '#')) as m
> from
>   maptbl2;
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40308) str_to_map should accept non-foldable delimiter arguments

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40308:


Assignee: Apache Spark

> str_to_map should accept non-foldable delimiter arguments
> -
>
> Key: SPARK-40308
> URL: https://issues.apache.org/jira/browse/SPARK-40308
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, str_to_map requires the delimiter arguments to be foldable 
> expressions. For example, the following doesn't work in Spark SQL:
> {noformat}
> drop table if exists maptbl;
> create table maptbl as select ',' as del1, ':' as del2, 'a:1,b:2,c:3' as str;
> insert into table maptbl select '%' as del1, '-' as del2, 'a-1%b-2%c-3' as 
> str;
> select str, str_to_map(str, del1, del2) from maptbl;
> {noformat}
> You get the following error:
> {noformat}
> str_to_map's delimiters must be foldable.; line 1 pos 12;
> {noformat}
> However, the above example SQL statements do work in Hive 2.3.9. There, you 
> get:
> {noformat}
> +--++
> | str  |_c1 |
> +--++
> | a:1,b:2,c:3  | {"a":"1","b":"2","c":"3"}  |
> | a-1%b-2%c-3  | {"a":"1","b":"2","c":"3"}  |
> +--++
> 2 rows selected (0.13 seconds)
> {noformat}
> It's unlikely that an input table would have the needed delimiters in 
> columns. The use-case is more likely to be something like this, where the 
> delimiters are determined based on some other value:
> {noformat}
> select
>   str,
>   str_to_map(str, ',', if(region = 0, ':', '#')) as m
> from
>   maptbl2;
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35161) Error-handling SQL functions

2022-09-01 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-35161:
---
Epic Link: SPARK-35030  (was: SPARK-38783)

> Error-handling SQL functions
> 
>
> Key: SPARK-35161
> URL: https://issues.apache.org/jira/browse/SPARK-35161
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Create new Error-handling version SQL functions for existing SQL 
> functions/operators, which returns NULL if overflow/error occurs. So that:
> 1. Users can manage to finish queries without interruptions in ANSI mode.
> 2. Users can get NULLs instead of unreasonable results if overflow occurs 
> when ANSI mode is off.
> For example, the behavior of the following SQL operations is unreasonable:
> {code:java}
> 2147483647 + 2 => -2147483647
> CAST(2147483648L AS INT) => -2147483648
> {code}
> With the new safe version SQL functions:
> {code:java}
> TRY_ADD(2147483647, 2) => null
> TRY_CAST(2147483648L AS INT) => null
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40310) try_sum() should throw exceptions from its child

2022-09-01 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-40310:
--

 Summary: try_sum() should throw exceptions from its child
 Key: SPARK-40310
 URL: https://issues.apache.org/jira/browse/SPARK-40310
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Similar to [https://github.com/apache/spark/pull/37486] and 
[https://github.com/apache/spark/pull/37663,]  the errors from try_sum()'s 
child should be shown instead of ignored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40310) try_sum() should throw the exceptions from its child

2022-09-01 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-40310:
---
Summary: try_sum() should throw the exceptions from its child  (was: 
try_sum() should throw exceptions from its child)

> try_sum() should throw the exceptions from its child
> 
>
> Key: SPARK-40310
> URL: https://issues.apache.org/jira/browse/SPARK-40310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Similar to [https://github.com/apache/spark/pull/37486] and 
> [https://github.com/apache/spark/pull/37663,]  the errors from try_sum()'s 
> child should be shown instead of ignored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40310) try_sum() should throw the exceptions from its child

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599185#comment-17599185
 ] 

Apache Spark commented on SPARK-40310:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37764

> try_sum() should throw the exceptions from its child
> 
>
> Key: SPARK-40310
> URL: https://issues.apache.org/jira/browse/SPARK-40310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Similar to [https://github.com/apache/spark/pull/37486] and 
> [https://github.com/apache/spark/pull/37663,]  the errors from try_sum()'s 
> child should be shown instead of ignored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40310) try_sum() should throw the exceptions from its child

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40310:


Assignee: Gengliang Wang  (was: Apache Spark)

> try_sum() should throw the exceptions from its child
> 
>
> Key: SPARK-40310
> URL: https://issues.apache.org/jira/browse/SPARK-40310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Similar to [https://github.com/apache/spark/pull/37486] and 
> [https://github.com/apache/spark/pull/37663,]  the errors from try_sum()'s 
> child should be shown instead of ignored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40310) try_sum() should throw the exceptions from its child

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40310:


Assignee: Apache Spark  (was: Gengliang Wang)

> try_sum() should throw the exceptions from its child
> 
>
> Key: SPARK-40310
> URL: https://issues.apache.org/jira/browse/SPARK-40310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Similar to [https://github.com/apache/spark/pull/37486] and 
> [https://github.com/apache/spark/pull/37663,]  the errors from try_sum()'s 
> child should be shown instead of ignored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40311) Introduce withColumnsRenamed

2022-09-01 Thread Santosh Pingale (Jira)
Santosh Pingale created SPARK-40311:
---

 Summary: Introduce withColumnsRenamed
 Key: SPARK-40311
 URL: https://issues.apache.org/jira/browse/SPARK-40311
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SparkR, SQL
Affects Versions: 3.2.2, 3.3.0, 3.1.3, 3.0.3
Reporter: Santosh Pingale


Add a scala, pyspark, R dataframe API that can rename multiple columns in a 
single command. This is mostly a performance related optimisations where users 
iteratively perform `withColumnRenamed`. With 100s columns and multiple 
iterations, there are cases where either driver will blow up or users will 
receive a StackOverflowError.
{code:java}
import datetime
import numpy as np
import pandas as pd

num_rows = 2
num_columns = 100
data = np.zeros((num_rows, num_columns))
columns = map(str, range(num_columns))
raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))

a = datetime.datetime.now()
{col, f"prefix_{col}" for col in raw.columns}
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

b = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

c = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

d = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

e = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

f = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

g = datetime.datetime.now()
g-a
datetime.timedelta(seconds=12, microseconds=480021) {code}
{code:java}
import datetime import numpy as np import pandas as pd num_rows = 2 num_columns 
= 100 data = np.zeros((num_rows, num_columns)) columns = map(str, 
range(num_columns)) raw = spark.createDataFrame(pd.DataFrame(data, 
columns=columns)) a = datetime.datetime.now() raw = 
DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark) b = datetime.datetime.now() raw = 
DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark) c = datetime.datetime.now() raw = 
DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark) d = datetime.datetime.now() raw = 
DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark) e = datetime.datetime.now() raw = 
DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark) f = datetime.datetime.now() raw = 
DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark) g = datetime.datetime.now() g-a 
datetime.timedelta(microseconds=632116)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40311) Introduce withColumnsRenamed

2022-09-01 Thread Santosh Pingale (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santosh Pingale updated SPARK-40311:

Description: 
Add a scala, pyspark, R dataframe API that can rename multiple columns in a 
single command. This is mostly a performance related optimisations where users 
iteratively perform `withColumnRenamed`. With 100s columns and multiple 
iterations, there are cases where either driver will blow up or users will 
receive a StackOverflowError.
{code:java}
import datetime
import numpy as np
import pandas as pd

num_rows = 2
num_columns = 100
data = np.zeros((num_rows, num_columns))
columns = map(str, range(num_columns))
raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))

a = datetime.datetime.now()
{col, f"prefix_{col}" for col in raw.columns}
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

b = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

c = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

d = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

e = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

f = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

g = datetime.datetime.now()
g-a
datetime.timedelta(seconds=12, microseconds=480021) {code}
{code:java}
import datetime
import numpy as np
import pandas as pd

num_rows = 2
num_columns = 100
data = np.zeros((num_rows, num_columns))
columns = map(str, range(num_columns))
raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))

a = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
b = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
c = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
d = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
e = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
f = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
g = datetime.datetime.now()
g-a
datetime.timedelta(microseconds=632116) {code}

  was:
Add a scala, pyspark, R dataframe API that can rename multiple columns in a 
single command. This is mostly a performance related optimisations where users 
iteratively perform `withColumnRenamed`. With 100s columns and multiple 
iterations, there are cases where either driver will blow up or users will 
receive a StackOverflowError.
{code:java}
import datetime
import numpy as np
import pandas as pd

num_rows = 2
num_columns = 100
data = np.zeros((num_rows, num_columns))
columns = map(str, range(num_columns))
raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))

a = datetime.datetime.now()
{col, f"prefix_{col}" for col in raw.columns}
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

b = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

c = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

d = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

e = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

f = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

g = datetime.datetime.now()
g-a
datetime.timedelta(seconds=12, microseconds=480021) {code}
{code:java}
import datetime import numpy as np import pandas as pd num_rows = 2 num_columns 
= 100 data = np.zeros((num_rows, num_columns)) columns = map(str, 
range(num_columns)) raw = spark.createDataFrame(pd.DataFrame(data, 
columns=columns)) a = datetime.datetime.now() raw = 
DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark) b = datetime.datetime.now() raw = 
DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark) c = datetime.datetime.now() raw = 
DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark) d = datetime.datetime.now() raw = 
DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark) e = datetime.datetime.now() raw = 
DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark) f = datetime.datetime.now() raw = 
DataFrame(raw._jdf.withColumnsR

[jira] [Assigned] (SPARK-40311) Introduce withColumnsRenamed

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40311:


Assignee: Apache Spark

> Introduce withColumnsRenamed
> 
>
> Key: SPARK-40311
> URL: https://issues.apache.org/jira/browse/SPARK-40311
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2
>Reporter: Santosh Pingale
>Assignee: Apache Spark
>Priority: Minor
>
> Add a scala, pyspark, R dataframe API that can rename multiple columns in a 
> single command. This is mostly a performance related optimisations where 
> users iteratively perform `withColumnRenamed`. With 100s columns and multiple 
> iterations, there are cases where either driver will blow up or users will 
> receive a StackOverflowError.
> {code:java}
> import datetime
> import numpy as np
> import pandas as pd
> num_rows = 2
> num_columns = 100
> data = np.zeros((num_rows, num_columns))
> columns = map(str, range(num_columns))
> raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))
> a = datetime.datetime.now()
> {col, f"prefix_{col}" for col in raw.columns}
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> b = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> c = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> d = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> e = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> f = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> g = datetime.datetime.now()
> g-a
> datetime.timedelta(seconds=12, microseconds=480021) {code}
> {code:java}
> import datetime
> import numpy as np
> import pandas as pd
> num_rows = 2
> num_columns = 100
> data = np.zeros((num_rows, num_columns))
> columns = map(str, range(num_columns))
> raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))
> a = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> b = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> c = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> d = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> e = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> f = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> g = datetime.datetime.now()
> g-a
> datetime.timedelta(microseconds=632116) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40311) Introduce withColumnsRenamed

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599191#comment-17599191
 ] 

Apache Spark commented on SPARK-40311:
--

User 'santosh-d3vpl3x' has created a pull request for this issue:
https://github.com/apache/spark/pull/37761

> Introduce withColumnsRenamed
> 
>
> Key: SPARK-40311
> URL: https://issues.apache.org/jira/browse/SPARK-40311
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2
>Reporter: Santosh Pingale
>Priority: Minor
>
> Add a scala, pyspark, R dataframe API that can rename multiple columns in a 
> single command. This is mostly a performance related optimisations where 
> users iteratively perform `withColumnRenamed`. With 100s columns and multiple 
> iterations, there are cases where either driver will blow up or users will 
> receive a StackOverflowError.
> {code:java}
> import datetime
> import numpy as np
> import pandas as pd
> num_rows = 2
> num_columns = 100
> data = np.zeros((num_rows, num_columns))
> columns = map(str, range(num_columns))
> raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))
> a = datetime.datetime.now()
> {col, f"prefix_{col}" for col in raw.columns}
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> b = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> c = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> d = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> e = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> f = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> g = datetime.datetime.now()
> g-a
> datetime.timedelta(seconds=12, microseconds=480021) {code}
> {code:java}
> import datetime
> import numpy as np
> import pandas as pd
> num_rows = 2
> num_columns = 100
> data = np.zeros((num_rows, num_columns))
> columns = map(str, range(num_columns))
> raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))
> a = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> b = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> c = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> d = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> e = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> f = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> g = datetime.datetime.now()
> g-a
> datetime.timedelta(microseconds=632116) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40311) Introduce withColumnsRenamed

2022-09-01 Thread Santosh Pingale (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santosh Pingale updated SPARK-40311:

Description: 
Add a scala, pyspark, R dataframe API that can rename multiple columns in a 
single command. This is mostly a performance related optimisations where users 
iteratively perform `withColumnRenamed`. With 100s columns and multiple 
iterations, there are cases where either driver will blow up or users will 
receive a StackOverflowError.
{code:java}
import datetime
import numpy as np
import pandas as pd

num_rows = 2
num_columns = 100
data = np.zeros((num_rows, num_columns))
columns = map(str, range(num_columns))
raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))

a = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

b = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

c = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

d = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

e = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

f = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

g = datetime.datetime.now()
g-a
datetime.timedelta(seconds=12, microseconds=480021) {code}
{code:java}
import datetime
import numpy as np
import pandas as pd

num_rows = 2
num_columns = 100
data = np.zeros((num_rows, num_columns))
columns = map(str, range(num_columns))
raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))

a = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
b = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
c = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
d = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
e = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
f = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
g = datetime.datetime.now()
g-a
datetime.timedelta(microseconds=632116) {code}

  was:
Add a scala, pyspark, R dataframe API that can rename multiple columns in a 
single command. This is mostly a performance related optimisations where users 
iteratively perform `withColumnRenamed`. With 100s columns and multiple 
iterations, there are cases where either driver will blow up or users will 
receive a StackOverflowError.
{code:java}
import datetime
import numpy as np
import pandas as pd

num_rows = 2
num_columns = 100
data = np.zeros((num_rows, num_columns))
columns = map(str, range(num_columns))
raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))

a = datetime.datetime.now()
{col, f"prefix_{col}" for col in raw.columns}
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

b = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

c = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

d = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

e = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

f = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

g = datetime.datetime.now()
g-a
datetime.timedelta(seconds=12, microseconds=480021) {code}
{code:java}
import datetime
import numpy as np
import pandas as pd

num_rows = 2
num_columns = 100
data = np.zeros((num_rows, num_columns))
columns = map(str, range(num_columns))
raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))

a = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
b = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
c = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
d = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
e = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
f = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}

[jira] [Assigned] (SPARK-40311) Introduce withColumnsRenamed

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40311:


Assignee: (was: Apache Spark)

> Introduce withColumnsRenamed
> 
>
> Key: SPARK-40311
> URL: https://issues.apache.org/jira/browse/SPARK-40311
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2
>Reporter: Santosh Pingale
>Priority: Minor
>
> Add a scala, pyspark, R dataframe API that can rename multiple columns in a 
> single command. This is mostly a performance related optimisations where 
> users iteratively perform `withColumnRenamed`. With 100s columns and multiple 
> iterations, there are cases where either driver will blow up or users will 
> receive a StackOverflowError.
> {code:java}
> import datetime
> import numpy as np
> import pandas as pd
> num_rows = 2
> num_columns = 100
> data = np.zeros((num_rows, num_columns))
> columns = map(str, range(num_columns))
> raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))
> a = datetime.datetime.now()
> {col, f"prefix_{col}" for col in raw.columns}
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> b = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> c = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> d = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> e = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> f = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> g = datetime.datetime.now()
> g-a
> datetime.timedelta(seconds=12, microseconds=480021) {code}
> {code:java}
> import datetime
> import numpy as np
> import pandas as pd
> num_rows = 2
> num_columns = 100
> data = np.zeros((num_rows, num_columns))
> columns = map(str, range(num_columns))
> raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))
> a = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> b = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> c = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> d = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> e = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> f = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> g = datetime.datetime.now()
> g-a
> datetime.timedelta(microseconds=632116) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40311) Introduce withColumnsRenamed

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599194#comment-17599194
 ] 

Apache Spark commented on SPARK-40311:
--

User 'santosh-d3vpl3x' has created a pull request for this issue:
https://github.com/apache/spark/pull/37761

> Introduce withColumnsRenamed
> 
>
> Key: SPARK-40311
> URL: https://issues.apache.org/jira/browse/SPARK-40311
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2
>Reporter: Santosh Pingale
>Priority: Minor
>
> Add a scala, pyspark, R dataframe API that can rename multiple columns in a 
> single command. This is mostly a performance related optimisations where 
> users iteratively perform `withColumnRenamed`. With 100s columns and multiple 
> iterations, there are cases where either driver will blow up or users will 
> receive a StackOverflowError.
> {code:java}
> import datetime
> import numpy as np
> import pandas as pd
> num_rows = 2
> num_columns = 100
> data = np.zeros((num_rows, num_columns))
> columns = map(str, range(num_columns))
> raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))
> a = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> b = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> c = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> d = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> e = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> f = datetime.datetime.now()
> for col in raw.columns:
> raw = raw.withColumnRenamed(col, f"prefix_{col}")
> g = datetime.datetime.now()
> g-a
> datetime.timedelta(seconds=12, microseconds=480021) {code}
> {code:java}
> import datetime
> import numpy as np
> import pandas as pd
> num_rows = 2
> num_columns = 100
> data = np.zeros((num_rows, num_columns))
> columns = map(str, range(num_columns))
> raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))
> a = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> b = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> c = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> d = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> e = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> f = datetime.datetime.now()
> raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
> raw.columns}), spark)
> g = datetime.datetime.now()
> g-a
> datetime.timedelta(microseconds=632116) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40311) Introduce withColumnsRenamed

2022-09-01 Thread Santosh Pingale (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santosh Pingale updated SPARK-40311:

Description: 
Add a scala, pyspark, R dataframe API that can rename multiple columns in a 
single command. Issues are faced when users iteratively perform 
`withColumnRenamed`.
 * When it works, we see slower performace
 * In some cases, StackOverflowError is raised due to logical plan being too big
 * In a few cases, driver died due to memory consumption

Some reproducible benchmarks:
{code:java}
import datetime
import numpy as np
import pandas as pd

num_rows = 2
num_columns = 100
data = np.zeros((num_rows, num_columns))
columns = map(str, range(num_columns))
raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))

a = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

b = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

c = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

d = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

e = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

f = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

g = datetime.datetime.now()
g-a
datetime.timedelta(seconds=12, microseconds=480021) {code}
{code:java}
import datetime
import numpy as np
import pandas as pd

num_rows = 2
num_columns = 100
data = np.zeros((num_rows, num_columns))
columns = map(str, range(num_columns))
raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))

a = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
b = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
c = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
d = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
e = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
f = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
g = datetime.datetime.now()
g-a
datetime.timedelta(microseconds=632116) {code}

  was:
Add a scala, pyspark, R dataframe API that can rename multiple columns in a 
single command. This is mostly a performance related optimisations where users 
iteratively perform `withColumnRenamed`. With 100s columns and multiple 
iterations, there are cases where either driver will blow up or users will 
receive a StackOverflowError.
{code:java}
import datetime
import numpy as np
import pandas as pd

num_rows = 2
num_columns = 100
data = np.zeros((num_rows, num_columns))
columns = map(str, range(num_columns))
raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))

a = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

b = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

c = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

d = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

e = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

f = datetime.datetime.now()
for col in raw.columns:
raw = raw.withColumnRenamed(col, f"prefix_{col}")

g = datetime.datetime.now()
g-a
datetime.timedelta(seconds=12, microseconds=480021) {code}
{code:java}
import datetime
import numpy as np
import pandas as pd

num_rows = 2
num_columns = 100
data = np.zeros((num_rows, num_columns))
columns = map(str, range(num_columns))
raw = spark.createDataFrame(pd.DataFrame(data, columns=columns))

a = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
b = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
c = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
d = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
e = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spark)
f = datetime.datetime.now()
raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in 
raw.columns}), spar

[jira] [Updated] (SPARK-40309) Introduce sql_conf context manager for pyspark.sql

2022-09-01 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-40309:

Labels: release-notes  (was: )

> Introduce sql_conf context manager for pyspark.sql
> --
>
> Key: SPARK-40309
> URL: https://issues.apache.org/jira/browse/SPARK-40309
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>  Labels: release-notes
>
> [https://github.com/apache/spark/blob/master/python/pyspark/pandas/utils.py#L490]
> a context manager is introduced to set the Spark SQL configuration and
> then restores it back when it exits, in Pandas API on Spark.
> That simplifies the control of Spark SQL configuration, 
> from
> {code:java}
> original_value = spark.conf.get("key")
> spark.conf.set("key", "value")
> ...
> spark.conf.set("key", original_value){code}
> to
> {code:java}
> with sql_conf({"key": "value"}):
> ...
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40288:


Assignee: (was: Apache Spark)

> After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should 
> applied to avoid attribute missing when use complex expression.
> --
>
> Key: SPARK-40288
> URL: https://issues.apache.org/jira/browse/SPARK-40288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
> Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0
>Reporter: hgs
>Priority: Minor
>
> {{--table}}
> {{create}}  {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} 
> {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}}
> {{--data}}
> {{insert}} {{overwrite }}{{table}} {{miss_expr 
> }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}}
> {{--failure sql}}
> {{insert}} {{overwrite }}{{table}} {{miss_expr}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n 
> }}{{{}from{}}}{{{}({}}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage 
> }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}}
> {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}}
> --error stack
> {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in 
> [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) 
> 100 else 200#12]}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}}
> {{at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.

2022-09-01 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40288:


Assignee: Apache Spark

> After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should 
> applied to avoid attribute missing when use complex expression.
> --
>
> Key: SPARK-40288
> URL: https://issues.apache.org/jira/browse/SPARK-40288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
> Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0
>Reporter: hgs
>Assignee: Apache Spark
>Priority: Minor
>
> {{--table}}
> {{create}}  {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} 
> {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}}
> {{--data}}
> {{insert}} {{overwrite }}{{table}} {{miss_expr 
> }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}}
> {{--failure sql}}
> {{insert}} {{overwrite }}{{table}} {{miss_expr}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n 
> }}{{{}from{}}}{{{}({}}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage 
> }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}}
> {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}}
> --error stack
> {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in 
> [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) 
> 100 else 200#12]}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}}
> {{at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599242#comment-17599242
 ] 

Apache Spark commented on SPARK-40288:
--

User 'hgs19921112' has created a pull request for this issue:
https://github.com/apache/spark/pull/37765

> After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should 
> applied to avoid attribute missing when use complex expression.
> --
>
> Key: SPARK-40288
> URL: https://issues.apache.org/jira/browse/SPARK-40288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
> Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0
>Reporter: hgs
>Priority: Minor
>
> {{--table}}
> {{create}}  {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} 
> {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}}
> {{--data}}
> {{insert}} {{overwrite }}{{table}} {{miss_expr 
> }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}}
> {{--failure sql}}
> {{insert}} {{overwrite }}{{table}} {{miss_expr}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n 
> }}{{{}from{}}}{{{}({}}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage 
> }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}}
> {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}}
> --error stack
> {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in 
> [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) 
> 100 else 200#12]}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}}
> {{at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599243#comment-17599243
 ] 

Apache Spark commented on SPARK-40288:
--

User 'hgs19921112' has created a pull request for this issue:
https://github.com/apache/spark/pull/37765

> After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should 
> applied to avoid attribute missing when use complex expression.
> --
>
> Key: SPARK-40288
> URL: https://issues.apache.org/jira/browse/SPARK-40288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
> Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0
>Reporter: hgs
>Priority: Minor
>
> {{--table}}
> {{create}}  {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} 
> {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}}
> {{--data}}
> {{insert}} {{overwrite }}{{table}} {{miss_expr 
> }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}}
> {{--failure sql}}
> {{insert}} {{overwrite }}{{table}} {{miss_expr}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n 
> }}{{{}from{}}}{{{}({}}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage 
> }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}}
> {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}}
> --error stack
> {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in 
> [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) 
> 100 else 200#12]}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}}
> {{at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599249#comment-17599249
 ] 

Apache Spark commented on SPARK-40288:
--

User 'hgs19921112' has created a pull request for this issue:
https://github.com/apache/spark/pull/37766

> After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should 
> applied to avoid attribute missing when use complex expression.
> --
>
> Key: SPARK-40288
> URL: https://issues.apache.org/jira/browse/SPARK-40288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
> Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0
>Reporter: hgs
>Priority: Minor
>
> {{--table}}
> {{create}}  {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} 
> {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}}
> {{--data}}
> {{insert}} {{overwrite }}{{table}} {{miss_expr 
> }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}}
> {{--failure sql}}
> {{insert}} {{overwrite }}{{table}} {{miss_expr}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n 
> }}{{{}from{}}}{{{}({}}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage 
> }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}}
> {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}}
> --error stack
> {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in 
> [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) 
> 100 else 200#12]}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}}
> {{at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599251#comment-17599251
 ] 

Apache Spark commented on SPARK-40288:
--

User 'hgs19921112' has created a pull request for this issue:
https://github.com/apache/spark/pull/37766

> After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should 
> applied to avoid attribute missing when use complex expression.
> --
>
> Key: SPARK-40288
> URL: https://issues.apache.org/jira/browse/SPARK-40288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
> Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0
>Reporter: hgs
>Priority: Minor
>
> {{--table}}
> {{create}}  {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} 
> {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}}
> {{--data}}
> {{insert}} {{overwrite }}{{table}} {{miss_expr 
> }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}}
> {{--failure sql}}
> {{insert}} {{overwrite }}{{table}} {{miss_expr}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n 
> }}{{{}from{}}}{{{}({}}}
> {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage 
> }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}}
> {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}}
> --error stack
> {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in 
> [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) 
> 100 else 200#12]}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}}
> {{at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}}
> {{at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40303) The performance will be worse after codegen

2022-09-01 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599254#comment-17599254
 ] 

Yang Jie commented on SPARK-40303:
--

Run use Java 8 with `-XX:+PrintCompilation`, I found the following logs:
{code:java}
64350 21862       4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)   COMPILE SKIPPED: unsupported incoming calling sequence (not 
retryable)
137245 22141 %     4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)   COMPILE SKIPPED: unsupported calling sequence (not 
retryable) {code}

> The performance will be worse after codegen
> ---
>
> Key: SPARK-40303
> URL: https://issues.apache.org/jira/browse/SPARK-40303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> import org.apache.spark.benchmark.Benchmark
> val dir = "/tmp/spark/benchmark"
> val N = 200
> val columns = Range(0, 100).map(i => s"id % $i AS id$i")
> spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir)
> // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100)
> Seq(60).foreach{ cnt =>
>   val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => 
> s"count(distinct $c)")
>   val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = 
> 1)
>   benchmark.addCase(s"$cnt count distinct with codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "true",
>   "spark.sql.codegen.factoryMode" -> "FALLBACK") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.addCase(s"$cnt count distinct without codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "false",
>   "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.run()
> }
> {code}
> {noformat}
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7
> Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> Benchmark count distinct: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> 60 count distinct with codegen   628146 628146
>0  0.0  314072.8   1.0X
> 60 count distinct without codegen147635 147635
>0  0.0   73817.5   4.3X
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40303) The performance will be worse after codegen

2022-09-01 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599254#comment-17599254
 ] 

Yang Jie edited comment on SPARK-40303 at 9/2/22 4:49 AM:
--

Run use Java 8 with `-XX:+PrintCompilation`, I found the following logs:
{code:java}
64350 21862       4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)   COMPILE SKIPPED: unsupported incoming calling sequence (not 
retryable)
137245 22141 %     4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)   COMPILE SKIPPED: unsupported calling sequence (not 
retryable) {code}
 

[~rednaxelafx] In this case, will compilation optimization degenerate to Level 
2?


was (Author: luciferyang):
Run use Java 8 with `-XX:+PrintCompilation`, I found the following logs:
{code:java}
64350 21862       4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)   COMPILE SKIPPED: unsupported incoming calling sequence (not 
retryable)
137245 22141 %     4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)   COMPILE SKIPPED: unsupported calling sequence (not 
retryable) {code}

> The performance will be worse after codegen
> ---
>
> Key: SPARK-40303
> URL: https://issues.apache.org/jira/browse/SPARK-40303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> import org.apache.spark.benchmark.Benchmark
> val dir = "/tmp/spark/benchmark"
> val N = 200
> val columns = Range(0, 100).map(i => s"id % $i AS id$i")
> spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir)
> // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100)
> Seq(60).foreach{ cnt =>
>   val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => 
> s"count(distinct $c)")
>   val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = 
> 1)
>   benchmark.addCase(s"$cnt count distinct with codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "true",
>   "spark.sql.codegen.factoryMode" -> "FALLBACK") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.addCase(s"$cnt count distinct without codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "false",
>   "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.run()
> }
> {code}
> {noformat}
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7
> Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> Benchmark count distinct: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> 60 count distinct with codegen   628146 628146
>0  0.0  314072.8   1.0X
> 60 count distinct without codegen147635 147635
>0  0.0   73817.5   4.3X
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39284) Implement Groupby.mad

2022-09-01 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599258#comment-17599258
 ] 

Apache Spark commented on SPARK-39284:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37767

> Implement Groupby.mad
> -
>
> Key: SPARK-39284
> URL: https://issues.apache.org/jira/browse/SPARK-39284
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40303) The performance will be worse after codegen

2022-09-01 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599254#comment-17599254
 ] 

Yang Jie edited comment on SPARK-40303 at 9/2/22 4:56 AM:
--

Run use Java 8 with `-XX:+PrintCompilation`, I found the following logs:
{code:java}
64350 21862       4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)   COMPILE SKIPPED: unsupported incoming calling sequence (not 
retryable)
137245 22141 %     4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)   COMPILE SKIPPED: unsupported calling sequence (not 
retryable) {code}
 

[~rednaxelafx] In this case, will compilation optimization degenerate to Level 
2 or give up?


was (Author: luciferyang):
Run use Java 8 with `-XX:+PrintCompilation`, I found the following logs:
{code:java}
64350 21862       4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)   COMPILE SKIPPED: unsupported incoming calling sequence (not 
retryable)
137245 22141 %     4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)   COMPILE SKIPPED: unsupported calling sequence (not 
retryable) {code}
 

[~rednaxelafx] In this case, will compilation optimization degenerate to Level 
2?

> The performance will be worse after codegen
> ---
>
> Key: SPARK-40303
> URL: https://issues.apache.org/jira/browse/SPARK-40303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> import org.apache.spark.benchmark.Benchmark
> val dir = "/tmp/spark/benchmark"
> val N = 200
> val columns = Range(0, 100).map(i => s"id % $i AS id$i")
> spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir)
> // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100)
> Seq(60).foreach{ cnt =>
>   val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => 
> s"count(distinct $c)")
>   val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = 
> 1)
>   benchmark.addCase(s"$cnt count distinct with codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "true",
>   "spark.sql.codegen.factoryMode" -> "FALLBACK") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.addCase(s"$cnt count distinct without codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "false",
>   "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.run()
> }
> {code}
> {noformat}
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7
> Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> Benchmark count distinct: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> 60 count distinct with codegen   628146 628146
>0  0.0  314072.8   1.0X
> 60 count distinct without codegen147635 147635
>0  0.0   73817.5   4.3X
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40303) The performance will be worse after codegen

2022-09-01 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599254#comment-17599254
 ] 

Yang Jie edited comment on SPARK-40303 at 9/2/22 4:58 AM:
--

Run use Java 8 with `-XX:+PrintCompilation`, I found the following logs:
{code:java}
64350 21862       4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)   COMPILE SKIPPED: unsupported incoming calling sequence (not 
retryable)
137245 22141 %     4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)   COMPILE SKIPPED: unsupported calling sequence (not 
retryable) {code}
 

[~rednaxelafx] In this case, will compilation optimization degenerate to Level 
2 or give up when use Java 8?


was (Author: luciferyang):
Run use Java 8 with `-XX:+PrintCompilation`, I found the following logs:
{code:java}
64350 21862       4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)   COMPILE SKIPPED: unsupported incoming calling sequence (not 
retryable)
137245 22141 %     4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)   COMPILE SKIPPED: unsupported calling sequence (not 
retryable) {code}
 

[~rednaxelafx] In this case, will compilation optimization degenerate to Level 
2 or give up?

> The performance will be worse after codegen
> ---
>
> Key: SPARK-40303
> URL: https://issues.apache.org/jira/browse/SPARK-40303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> import org.apache.spark.benchmark.Benchmark
> val dir = "/tmp/spark/benchmark"
> val N = 200
> val columns = Range(0, 100).map(i => s"id % $i AS id$i")
> spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir)
> // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100)
> Seq(60).foreach{ cnt =>
>   val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => 
> s"count(distinct $c)")
>   val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = 
> 1)
>   benchmark.addCase(s"$cnt count distinct with codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "true",
>   "spark.sql.codegen.factoryMode" -> "FALLBACK") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.addCase(s"$cnt count distinct without codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "false",
>   "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.run()
> }
> {code}
> {noformat}
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7
> Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> Benchmark count distinct: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> 60 count distinct with codegen   628146 628146
>0  0.0  314072.8   1.0X
> 60 count distinct without codegen147635 147635
>0  0.0   73817.5   4.3X
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40303) The performance will be worse after codegen

2022-09-01 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599254#comment-17599254
 ] 

Yang Jie edited comment on SPARK-40303 at 9/2/22 4:59 AM:
--

Run use Java 8 with `-XX:+PrintCompilation`, I found the following logs:
{code:java}
64350 21862       4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)   COMPILE SKIPPED: unsupported incoming calling sequence (not 
retryable)
137245 22141 %     4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)   COMPILE SKIPPED: unsupported calling sequence (not 
retryable) {code}
 

[~rednaxelafx] In this case, will compilation optimization degenerate to Level 
3 or give up when use Java 8?


was (Author: luciferyang):
Run use Java 8 with `-XX:+PrintCompilation`, I found the following logs:
{code:java}
64350 21862       4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)   COMPILE SKIPPED: unsupported incoming calling sequence (not 
retryable)
137245 22141 %     4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)   COMPILE SKIPPED: unsupported calling sequence (not 
retryable) {code}
 

[~rednaxelafx] In this case, will compilation optimization degenerate to Level 
2 or give up when use Java 8?

> The performance will be worse after codegen
> ---
>
> Key: SPARK-40303
> URL: https://issues.apache.org/jira/browse/SPARK-40303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> import org.apache.spark.benchmark.Benchmark
> val dir = "/tmp/spark/benchmark"
> val N = 200
> val columns = Range(0, 100).map(i => s"id % $i AS id$i")
> spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir)
> // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100)
> Seq(60).foreach{ cnt =>
>   val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => 
> s"count(distinct $c)")
>   val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = 
> 1)
>   benchmark.addCase(s"$cnt count distinct with codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "true",
>   "spark.sql.codegen.factoryMode" -> "FALLBACK") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.addCase(s"$cnt count distinct without codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "false",
>   "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.run()
> }
> {code}
> {noformat}
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7
> Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> Benchmark count distinct: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> 60 count distinct with codegen   628146 628146
>0  0.0  314072.8   1.0X
> 60 count distinct without codegen147635 147635
>0  0.0   73817.5   4.3X
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40303) The performance will be worse after codegen

2022-09-01 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599262#comment-17599262
 ] 

Yang Jie commented on SPARK-40303:
--

{code:java}
 64265 21820       3       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)
  64308 21862       4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)
  64350 21862       4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)   COMPILE SKIPPED: unsupported incoming calling sequence (not 
retryable) {code}
{code:java}
103941 22065 %     3       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)
 104602 22067       3       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 (2712 bytes)
 135979 22141 %     4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)
 137245 22141 %     4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)   COMPILE SKIPPED: unsupported calling sequence (not 
retryable) {code}
The compilation relevant  logs are as above when Java 8 is used

> The performance will be worse after codegen
> ---
>
> Key: SPARK-40303
> URL: https://issues.apache.org/jira/browse/SPARK-40303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> import org.apache.spark.benchmark.Benchmark
> val dir = "/tmp/spark/benchmark"
> val N = 200
> val columns = Range(0, 100).map(i => s"id % $i AS id$i")
> spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir)
> // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100)
> Seq(60).foreach{ cnt =>
>   val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => 
> s"count(distinct $c)")
>   val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = 
> 1)
>   benchmark.addCase(s"$cnt count distinct with codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "true",
>   "spark.sql.codegen.factoryMode" -> "FALLBACK") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.addCase(s"$cnt count distinct without codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "false",
>   "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.run()
> }
> {code}
> {noformat}
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7
> Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> Benchmark count distinct: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> 60 count distinct with codegen   628146 628146
>0  0.0  314072.8   1.0X
> 60 count distinct without codegen147635 147635
>0  0.0   73817.5   4.3X
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40303) The performance will be worse after codegen

2022-09-01 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599263#comment-17599263
 ] 

Yang Jie commented on SPARK-40303:
--

If you run with Java 17, the performance gap will be smaller. The compilation 
logs of `hashAgg_doConsume_0` and `hashAgg_doAggregateWithKeys_0` as follows:

 
{code:java}
 102158 22568       3       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)
 102180 22606       4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)
 102228 22606       4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)   COMPILE SKIPPED: unsupported incoming calling sequence (retry 
at different tier)
 102228 22619       1       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)
 102240 22568       3       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)   made not entrant
 218296 24067       3       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2::hashAgg_doConsume_0$
 (2052 bytes)
 218463 22568       3       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)   made zombie {code}
 
{code:java}
105832 22708 %     3       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)
 105955 22709       3       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 (2712 bytes)
 108247 22741 %     4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)
 108484 22741 %     4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)   COMPILE SKIPPED: unsupported calling sequence (retry at 
different tier)
 108727 22708 %     3       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)   made not entrant
 108727 22743 %     1       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)
 218463 22708 %     3       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)   made zombie {code}

> The performance will be worse after codegen
> ---
>
> Key: SPARK-40303
> URL: https://issues.apache.org/jira/browse/SPARK-40303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> import org.apache.spark.benchmark.Benchmark
> val dir = "/tmp/spark/benchmark"
> val N = 200
> val columns = Range(0, 100).map(i => s"id % $i AS id$i")
> spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir)
> // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100)
> Seq(60).foreach{ cnt =>
>   val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => 
> s"count(distinct $c)")
>   val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = 
> 1)
>   benchmark.addCase(s"$cnt count distinct with codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "true",
>   "spark.sql.codegen.factoryMode" -> "FALLBACK") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.addCase(s"$cnt count distinct without codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "false",
>   "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.run()
> }
> {code}
> {noformat}
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7
> Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> Benchmark count distinct: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> 60 count distinct with codegen   628146 628146
>0  0.0  314072.8   

[jira] [Comment Edited] (SPARK-40303) The performance will be worse after codegen

2022-09-01 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599262#comment-17599262
 ] 

Yang Jie edited comment on SPARK-40303 at 9/2/22 5:15 AM:
--

{code:java}
 64265 21820       3       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)
  64308 21862       4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)
  64350 21862       4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)   COMPILE SKIPPED: unsupported incoming calling sequence (not 
retryable) 
1735350 24101       3       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2::hashAgg_doConsume_0$
 (2052 bytes){code}
{code:java}
103941 22065 %     3       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)
 104602 22067       3       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 (2712 bytes)
 135979 22141 %     4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)
 137245 22141 %     4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)   COMPILE SKIPPED: unsupported calling sequence (not 
retryable) {code}
The compilation relevant  logs are as above when Java 8 is used


was (Author: luciferyang):
{code:java}
 64265 21820       3       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)
  64308 21862       4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)
  64350 21862       4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$
 (2051 bytes)   COMPILE SKIPPED: unsupported incoming calling sequence (not 
retryable) {code}
{code:java}
103941 22065 %     3       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)
 104602 22067       3       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 (2712 bytes)
 135979 22141 %     4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)
 137245 22141 %     4       
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$
 @ 38 (2712 bytes)   COMPILE SKIPPED: unsupported calling sequence (not 
retryable) {code}
The compilation relevant  logs are as above when Java 8 is used

> The performance will be worse after codegen
> ---
>
> Key: SPARK-40303
> URL: https://issues.apache.org/jira/browse/SPARK-40303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> import org.apache.spark.benchmark.Benchmark
> val dir = "/tmp/spark/benchmark"
> val N = 200
> val columns = Range(0, 100).map(i => s"id % $i AS id$i")
> spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir)
> // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100)
> Seq(60).foreach{ cnt =>
>   val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => 
> s"count(distinct $c)")
>   val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = 
> 1)
>   benchmark.addCase(s"$cnt count distinct with codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "true",
>   "spark.sql.codegen.factoryMode" -> "FALLBACK") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.addCase(s"$cnt count distinct without codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "false",
>   "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.run()
> }
> {code}
> {noformat}
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7
> Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> Benchmark count distinct: Best Time(ms)   Avg Time(m

  1   2   >