[jira] [Updated] (SPARK-40298) shuffle data recovery on the reused PVCs no effect
[ https://issues.apache.org/jira/browse/SPARK-40298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] todd updated SPARK-40298: - Priority: Blocker (was: Major) > shuffle data recovery on the reused PVCs no effect > --- > > Key: SPARK-40298 > URL: https://issues.apache.org/jira/browse/SPARK-40298 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.2 >Reporter: todd >Priority: Blocker > Attachments: 1662002808396.jpg, 1662002822097.jpg > > > I use spark3.2.2 to test the [ Support shuffle data recovery on the reused > PVCs (SPARK-35593) ] feature.I found that when shuffle read fails, data is > still read from source. > It can be confirmed that the pvc has been multiplexed by other pods, and the > Index and data data information has been sent > *This is my spark configuration information:* > --conf spark.driver.memory=5G > --conf spark.executor.memory=15G > --conf spark.executor.cores=1 > --conf spark.executor.instances=50 > --conf spark.sql.shuffle.partitions=50 > --conf spark.dynamicAllocation.enabled=false > --conf spark.kubernetes.driver.reusePersistentVolumeClaim=true > --conf spark.kubernetes.driver.ownPersistentVolumeClaim=true > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=gp2 > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=100Gi > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/tmp/data > --conf > spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false > --conf spark.executorEnv.SPARK_EXECUTOR_DIRS=/tmp/data > --conf > spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO > --conf spark.kubernetes.executor.missingPodDetectDelta=10s > --conf spark.kubernetes.executor.apiPollingInterval=10s > --conf spark.shuffle.io.retryWait=60s > --conf spark.shuffle.io.maxRetries=5 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40301) Add parameter validation in pyspark.rdd
Ruifeng Zheng created SPARK-40301: - Summary: Add parameter validation in pyspark.rdd Key: SPARK-40301 URL: https://issues.apache.org/jira/browse/SPARK-40301 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40301) Add parameter validation in pyspark.rdd
[ https://issues.apache.org/jira/browse/SPARK-40301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598784#comment-17598784 ] Apache Spark commented on SPARK-40301: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37752 > Add parameter validation in pyspark.rdd > --- > > Key: SPARK-40301 > URL: https://issues.apache.org/jira/browse/SPARK-40301 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40301) Add parameter validation in pyspark.rdd
[ https://issues.apache.org/jira/browse/SPARK-40301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40301: Assignee: (was: Apache Spark) > Add parameter validation in pyspark.rdd > --- > > Key: SPARK-40301 > URL: https://issues.apache.org/jira/browse/SPARK-40301 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40301) Add parameter validation in pyspark.rdd
[ https://issues.apache.org/jira/browse/SPARK-40301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40301: Assignee: Apache Spark > Add parameter validation in pyspark.rdd > --- > > Key: SPARK-40301 > URL: https://issues.apache.org/jira/browse/SPARK-40301 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39664) RowMatrix(...).computeCovariance() VS Correlation.corr(..., ...)
[ https://issues.apache.org/jira/browse/SPARK-39664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598813#comment-17598813 ] Ruifeng Zheng commented on SPARK-39664: --- [~igaloly] _RowMatrix_ is in .mllib, while _Correlation_ is in .ml. The .mllib is implemented on top of RDD, while .ml on DataFrame. .mllib is in maintainance mode, and we update it only if there is a correctness issue. > RowMatrix(...).computeCovariance() VS Correlation.corr(..., ...) > > > Key: SPARK-39664 > URL: https://issues.apache.org/jira/browse/SPARK-39664 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.2.1 >Reporter: igal l >Priority: Major > > I have a Pyspark DF with one column. This column type is Vector and the > values are DenseVectors of size 768. The DF has 1 million rows. > I want to calculate the Covariance matrix of this set of vectors. > When I try to calculate it with > `RowMatrix(df.rdd.map(list)).computeCovariance()`, it takes 1.57 minuts. > When I try to calculate the Correlation matrix with `Correlation.corr(df, > '_1')`, it takes 33 seconds. > Covariance and Correlation's formula are pretty much the same, therefore, I > don't understand the gap between them -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39664) RowMatrix(...).computeCovariance() VS Correlation.corr(..., ...)
[ https://issues.apache.org/jira/browse/SPARK-39664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-39664. --- Resolution: Not A Problem > RowMatrix(...).computeCovariance() VS Correlation.corr(..., ...) > > > Key: SPARK-39664 > URL: https://issues.apache.org/jira/browse/SPARK-39664 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.2.1 >Reporter: igal l >Priority: Major > > I have a Pyspark DF with one column. This column type is Vector and the > values are DenseVectors of size 768. The DF has 1 million rows. > I want to calculate the Covariance matrix of this set of vectors. > When I try to calculate it with > `RowMatrix(df.rdd.map(list)).computeCovariance()`, it takes 1.57 minuts. > When I try to calculate the Correlation matrix with `Correlation.corr(df, > '_1')`, it takes 33 seconds. > Covariance and Correlation's formula are pretty much the same, therefore, I > don't understand the gap between them -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40302) Add YuniKornSuite
Dongjoon Hyun created SPARK-40302: - Summary: Add YuniKornSuite Key: SPARK-40302 URL: https://issues.apache.org/jira/browse/SPARK-40302 Project: Spark Issue Type: Test Components: Kubernetes, Tests Affects Versions: 3.4.0, 3.3.1 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40302) Add YuniKornSuite
[ https://issues.apache.org/jira/browse/SPARK-40302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598826#comment-17598826 ] Apache Spark commented on SPARK-40302: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/37753 > Add YuniKornSuite > - > > Key: SPARK-40302 > URL: https://issues.apache.org/jira/browse/SPARK-40302 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.4.0, 3.3.1 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40302) Add YuniKornSuite
[ https://issues.apache.org/jira/browse/SPARK-40302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40302: Assignee: Apache Spark > Add YuniKornSuite > - > > Key: SPARK-40302 > URL: https://issues.apache.org/jira/browse/SPARK-40302 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.4.0, 3.3.1 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40302) Add YuniKornSuite
[ https://issues.apache.org/jira/browse/SPARK-40302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40302: Assignee: (was: Apache Spark) > Add YuniKornSuite > - > > Key: SPARK-40302 > URL: https://issues.apache.org/jira/browse/SPARK-40302 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.4.0, 3.3.1 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40303) The performance will be worse after codegen
Yuming Wang created SPARK-40303: --- Summary: The performance will be worse after codegen Key: SPARK-40303 URL: https://issues.apache.org/jira/browse/SPARK-40303 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Yuming Wang {code:scala} import org.apache.spark.benchmark.Benchmark val dir = "/tmp/spark/benchmark" val N = 200 val columns = Range(0, 100).map(i => s"id % $i AS id$i") spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir) // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100) Seq(60).foreach{ cnt => val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => s"count(distinct $c)") val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = 1) benchmark.addCase(s"$cnt count distinct with codegen") { _ => withSQLConf( "spark.sql.codegen.wholeStage" -> "true", "spark.sql.codegen.factoryMode" -> "FALLBACK") { spark.read.parquet(dir).selectExpr(selectExps: _*).write.format("noop").mode("Overwrite").save() } } benchmark.addCase(s"$cnt count distinct without codegen") { _ => withSQLConf( "spark.sql.codegen.wholeStage" -> "false", "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") { spark.read.parquet(dir).selectExpr(selectExps: _*).write.format("noop").mode("Overwrite").save() } } benchmark.run() } {code} {noformat} Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz Benchmark count distinct: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative 60 count distinct with codegen 628146 628146 0 0.0 314072.8 1.0X 60 count distinct without codegen147635 147635 0 0.0 73817.5 4.3X {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40303) The performance will be worse after codegen
[ https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598831#comment-17598831 ] Yuming Wang commented on SPARK-40303: - cc [~cloud_fan] [~joshrosen] [~rednaxelafx] > The performance will be worse after codegen > --- > > Key: SPARK-40303 > URL: https://issues.apache.org/jira/browse/SPARK-40303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > import org.apache.spark.benchmark.Benchmark > val dir = "/tmp/spark/benchmark" > val N = 200 > val columns = Range(0, 100).map(i => s"id % $i AS id$i") > spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir) > // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100) > Seq(60).foreach{ cnt => > val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => > s"count(distinct $c)") > val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = > 1) > benchmark.addCase(s"$cnt count distinct with codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "true", > "spark.sql.codegen.factoryMode" -> "FALLBACK") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.addCase(s"$cnt count distinct without codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "false", > "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.run() > } > {code} > {noformat} > Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7 > Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > Benchmark count distinct: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > 60 count distinct with codegen 628146 628146 >0 0.0 314072.8 1.0X > 60 count distinct without codegen147635 147635 >0 0.0 73817.5 4.3X > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40287) Load Data using Spark by a single partition moves entire dataset under same location in S3
[ https://issues.apache.org/jira/browse/SPARK-40287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598835#comment-17598835 ] Steve Loughran commented on SPARK-40287: does this happen when # you switch to an ASF spark build with the s3a connector # and use an s3a committer safe to use with spark this is clearly EMR (s3:// URLs), so they have to be the people to talk to if you can't replicate it in the apache code > Load Data using Spark by a single partition moves entire dataset under same > location in S3 > -- > > Key: SPARK-40287 > URL: https://issues.apache.org/jira/browse/SPARK-40287 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm experiencing an issue in PySpark when creating a hive table and loading > in the data to the table. So I'm using an Amazon s3 bucket as a data location > and I'm creating a table as parquet and trying to load data into that table > by a single partition, and I'm seeing some weird behavior. When selecting the > data location in s3 of a parquet file to load into my table. All of the data > is moved into the specified location in my create table command including the > partitions I didn't specify in the load data command. For example: > {code:java} > # create a data frame in pyspark with partitions > df = spark.createDataFrame([("a", 1, "x"), ("b", 2, "y"), ("c", 3, "y")], > ["c1", "c2", "p"]) > # save it to S3 > df.write.format("parquet").mode("overwrite").partitionBy("p").save("s3://bucket/data/") > {code} > In the current state S3 should have a new folder `data` with two folders > which contain a parquet file in each partition. > > - s3://bucket/data/p=x/ > - part-1.snappy.parquet > - s3://bucket/data/p=y/ > - part-2.snappy.parquet > - part-3.snappy.parquet > > {code:java} > # create new table > spark.sql("create table src (c1 string,c2 int) PARTITIONED BY (p string) > STORED AS parquet LOCATION 's3://bucket/new/'") > # load the saved table data from s3 specifying single partition value x > spark.sql("LOAD DATA INPATH 's3://bucket/data/'INTO TABLE src PARTITION > (p='x')") > spark.sql("select * from src").show() > # output: > # +---+---+---+ > # | c1| c2| p| > # +---+---+---+ > # +---+---+---+ > {code} > After running the `load data` command, and looking at the table I'm left with > no data loaded in. When checking S3 the data source we saved earlier is moved > under `s3://bucket/new/` oddly enough it also brought over the other > partitions along with it directory structure listed below. > - s3://bucket/new/ > - p=x/ > - p=x/ > - part-1.snappy.parquet > - p=y/ > - part-2.snappy.parquet > - part-3.snappy.parquet > Is this the intended behavior of loading the data in from a partitioned > parquet file? Is the previous file supposed to be moved/deleted from source > directory? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40286) Load Data from S3 deletes data source file
[ https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598837#comment-17598837 ] Steve Loughran commented on SPARK-40286: this is EMR. can you repliacate in an ASF spark release through the s3a connector and committers? if you can replicate, especiallly in spark standadone, turn spark and org.apache.hadoop.fs.s3a logging on to debug and see what it says > Load Data from S3 deletes data source file > -- > > Key: SPARK-40286 > URL: https://issues.apache.org/jira/browse/SPARK-40286 > Project: Spark > Issue Type: Question > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Drew >Priority: Major > > Hello, > I'm using spark to [load > data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into > a hive table through Pyspark, and when I load data from a path in Amazon S3, > the original file is getting wiped from the Directory. The file is found, and > is populating the table with data. I also tried to add the `Local` clause but > that throws an error when looking for the file. When looking through the > documentation it doesn't explicitly state that this is the intended behavior. > Thanks in advance! > {code:java} > spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile") > spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE > src"){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39906) Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead'
[ https://issues.apache.org/jira/browse/SPARK-39906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598845#comment-17598845 ] Apache Spark commented on SPARK-39906: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/37754 > Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash > syntax instead' > -- > > Key: SPARK-39906 > URL: https://issues.apache.org/jira/browse/SPARK-39906 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > 1. > {code:java} > ./2_Run Build modules catalyst, > hive-thriftserver.txt:2022-07-27T01:23:12.1294533Z [warn] sbt 0.13 shell > syntax is deprecated; use slash syntax instead: examples / Test / package, > avro / Test / package, network-shuffle / Test / package, sketch / Test / > package, unsafe / Test / package, launcher / Test / package, network-yarn / > Test / package, streaming / Test / package, catalyst / Test / package, > hive-thriftserver / Test / package, kvstore / Test / package, core / Test / > package, ganglia-lgpl / Test / package, hadoop-cloud / Test / package, > streaming-kinesis-asl-assembly / Test / package, assembly / Test / package, > network-common / Test / package, sql / Test / package, > streaming-kafka-0-10-assembly / Test / package, mllib / Test / package, > streaming-kinesis-asl / Test / package, docker-integration-tests / Test / > package, kubernetes / Test / package, yarn / Test / package, tags / Test / > package, graphx / Test / package, token-provider-kafka-0-10 / Test / package, > mesos / Test / package, streaming-kafka-0-10 / Test / package, hive / Test / > package, tools / Test / package, mllib-local / Test / package, repl / Test / > package, sql-kafka-0-10 / Test / package, Test / package > ./2_Run Build modules catalyst, > hive-thriftserver.txt:2022-07-27T01:23:12.1294533Z [warn] sbt 0.13 shell > syntax is deprecated; use slash syntax instead: examples / Test / package, > avro / Test / package, network-shuffle / Test / package, sketch / Test / > package, unsafe / Test / package, launcher / Test / package, network-yarn / > Test / package, streaming / Test / package, catalyst / Test / package, > hive-thriftserver / Test / package, kvstore / Test / package, core / Test / > package, ganglia-lgpl / Test / package, hadoop-cloud / Test / package, > streaming-kinesis-asl-assembly / Test / package, assembly / Test / package, > network-common / Test / package, sql / Test / package, > streaming-kafka-0-10-assembly / Test / package, mllib / Test / package, > streaming-kinesis-asl / Test / package, docker-integration-tests / Test / > package, kubernetes / Test / package, yarn / Test / package, tags / Test / > package, graphx / Test / package, token-provider-kafka-0-10 / Test / package, > mesos / Test / package, streaming-kafka-0-10 / Test / package, hive / Test / > package, tools / Test / package, mllib-local / Test / package, repl / Test / > package, sql-kafka-0-10 / Test / package, Test / package > ./Run Build modules pyspark-core, pyspark-streaming, pyspark-ml/11_Run > tests.txt:2022-07-27T01:25:30.0840251Z [warn] sbt 0.13 shell syntax is > deprecated; use slash syntax instead: network-yarn / Test / package, > network-shuffle / Test / package, sketch / Test / package, yarn / Test / > package, sql / Test / package, core / Test / package, hive-thriftserver / > Test / package, ganglia-lgpl / Test / package, streaming / Test / package, > streaming-kinesis-asl / Test / package, docker-integration-tests / Test / > package, kubernetes / Test / package, launcher / Test / package, > streaming-kinesis-asl-assembly / Test / package, tags / Test / package, > assembly / Test / package, mllib-local / Test / package, > token-provider-kafka-0-10 / Test / package, repl / Test / package, graphx / > Test / package, sql-kafka-0-10 / Test / package, mesos / Test / package, > streaming-kafka-0-10 / Test / package, streaming-kafka-0-10-assembly / Test / > package, examples / Test / package, tools / Test / package, avro / Test / > package, hadoop-cloud / Test / package, mllib / Test / package, kvstore / > Test / package, hive / Test / package, catalyst / Test / package, > network-common / Test / package, unsafe / Test / package, Test / package > ./Run Build modules pyspark-core, pyspark-streaming, pyspark-ml/11_Run > tests.txt:2022-07-27T01:25:30.0840251Z [warn] sbt 0.13 shell syntax is > deprecated; use slash syntax instead: network-yarn / Test / package, > network-shuffle / Test / package, sketch / Test / package, yarn / Test / > package, sql / Test
[jira] [Commented] (SPARK-39906) Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash syntax instead'
[ https://issues.apache.org/jira/browse/SPARK-39906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598847#comment-17598847 ] Apache Spark commented on SPARK-39906: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/37754 > Eliminate build warnings - 'sbt 0.13 shell syntax is deprecated; use slash > syntax instead' > -- > > Key: SPARK-39906 > URL: https://issues.apache.org/jira/browse/SPARK-39906 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Fix For: 3.4.0 > > > 1. > {code:java} > ./2_Run Build modules catalyst, > hive-thriftserver.txt:2022-07-27T01:23:12.1294533Z [warn] sbt 0.13 shell > syntax is deprecated; use slash syntax instead: examples / Test / package, > avro / Test / package, network-shuffle / Test / package, sketch / Test / > package, unsafe / Test / package, launcher / Test / package, network-yarn / > Test / package, streaming / Test / package, catalyst / Test / package, > hive-thriftserver / Test / package, kvstore / Test / package, core / Test / > package, ganglia-lgpl / Test / package, hadoop-cloud / Test / package, > streaming-kinesis-asl-assembly / Test / package, assembly / Test / package, > network-common / Test / package, sql / Test / package, > streaming-kafka-0-10-assembly / Test / package, mllib / Test / package, > streaming-kinesis-asl / Test / package, docker-integration-tests / Test / > package, kubernetes / Test / package, yarn / Test / package, tags / Test / > package, graphx / Test / package, token-provider-kafka-0-10 / Test / package, > mesos / Test / package, streaming-kafka-0-10 / Test / package, hive / Test / > package, tools / Test / package, mllib-local / Test / package, repl / Test / > package, sql-kafka-0-10 / Test / package, Test / package > ./2_Run Build modules catalyst, > hive-thriftserver.txt:2022-07-27T01:23:12.1294533Z [warn] sbt 0.13 shell > syntax is deprecated; use slash syntax instead: examples / Test / package, > avro / Test / package, network-shuffle / Test / package, sketch / Test / > package, unsafe / Test / package, launcher / Test / package, network-yarn / > Test / package, streaming / Test / package, catalyst / Test / package, > hive-thriftserver / Test / package, kvstore / Test / package, core / Test / > package, ganglia-lgpl / Test / package, hadoop-cloud / Test / package, > streaming-kinesis-asl-assembly / Test / package, assembly / Test / package, > network-common / Test / package, sql / Test / package, > streaming-kafka-0-10-assembly / Test / package, mllib / Test / package, > streaming-kinesis-asl / Test / package, docker-integration-tests / Test / > package, kubernetes / Test / package, yarn / Test / package, tags / Test / > package, graphx / Test / package, token-provider-kafka-0-10 / Test / package, > mesos / Test / package, streaming-kafka-0-10 / Test / package, hive / Test / > package, tools / Test / package, mllib-local / Test / package, repl / Test / > package, sql-kafka-0-10 / Test / package, Test / package > ./Run Build modules pyspark-core, pyspark-streaming, pyspark-ml/11_Run > tests.txt:2022-07-27T01:25:30.0840251Z [warn] sbt 0.13 shell syntax is > deprecated; use slash syntax instead: network-yarn / Test / package, > network-shuffle / Test / package, sketch / Test / package, yarn / Test / > package, sql / Test / package, core / Test / package, hive-thriftserver / > Test / package, ganglia-lgpl / Test / package, streaming / Test / package, > streaming-kinesis-asl / Test / package, docker-integration-tests / Test / > package, kubernetes / Test / package, launcher / Test / package, > streaming-kinesis-asl-assembly / Test / package, tags / Test / package, > assembly / Test / package, mllib-local / Test / package, > token-provider-kafka-0-10 / Test / package, repl / Test / package, graphx / > Test / package, sql-kafka-0-10 / Test / package, mesos / Test / package, > streaming-kafka-0-10 / Test / package, streaming-kafka-0-10-assembly / Test / > package, examples / Test / package, tools / Test / package, avro / Test / > package, hadoop-cloud / Test / package, mllib / Test / package, kvstore / > Test / package, hive / Test / package, catalyst / Test / package, > network-common / Test / package, unsafe / Test / package, Test / package > ./Run Build modules pyspark-core, pyspark-streaming, pyspark-ml/11_Run > tests.txt:2022-07-27T01:25:30.0840251Z [warn] sbt 0.13 shell syntax is > deprecated; use slash syntax instead: network-yarn / Test / package, > network-shuffle / Test / package, sketch / Test / package, yarn / Test / > package, sql / Test
[jira] [Updated] (SPARK-40304) Add decomTestTag to K8s Integration Test
[ https://issues.apache.org/jira/browse/SPARK-40304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-40304: -- Priority: Minor (was: Major) > Add decomTestTag to K8s Integration Test > > > Key: SPARK-40304 > URL: https://issues.apache.org/jira/browse/SPARK-40304 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40304) Add decomTestTag to K8s Integration Test
Dongjoon Hyun created SPARK-40304: - Summary: Add decomTestTag to K8s Integration Test Key: SPARK-40304 URL: https://issues.apache.org/jira/browse/SPARK-40304 Project: Spark Issue Type: Test Components: Kubernetes, Tests Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40304) Add decomTestTag to K8s Integration Test
[ https://issues.apache.org/jira/browse/SPARK-40304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598866#comment-17598866 ] Apache Spark commented on SPARK-40304: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/37755 > Add decomTestTag to K8s Integration Test > > > Key: SPARK-40304 > URL: https://issues.apache.org/jira/browse/SPARK-40304 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40304) Add decomTestTag to K8s Integration Test
[ https://issues.apache.org/jira/browse/SPARK-40304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40304: Assignee: (was: Apache Spark) > Add decomTestTag to K8s Integration Test > > > Key: SPARK-40304 > URL: https://issues.apache.org/jira/browse/SPARK-40304 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40304) Add decomTestTag to K8s Integration Test
[ https://issues.apache.org/jira/browse/SPARK-40304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40304: Assignee: Apache Spark > Add decomTestTag to K8s Integration Test > > > Key: SPARK-40304 > URL: https://issues.apache.org/jira/browse/SPARK-40304 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40304) Add decomTestTag to K8s Integration Test
[ https://issues.apache.org/jira/browse/SPARK-40304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598867#comment-17598867 ] Apache Spark commented on SPARK-40304: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/37755 > Add decomTestTag to K8s Integration Test > > > Key: SPARK-40304 > URL: https://issues.apache.org/jira/browse/SPARK-40304 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40305) Implement Groupby.sem
Ruifeng Zheng created SPARK-40305: - Summary: Implement Groupby.sem Key: SPARK-40305 URL: https://issues.apache.org/jira/browse/SPARK-40305 Project: Spark Issue Type: Improvement Components: ps Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40305) Implement Groupby.sem
[ https://issues.apache.org/jira/browse/SPARK-40305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598905#comment-17598905 ] Apache Spark commented on SPARK-40305: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37756 > Implement Groupby.sem > - > > Key: SPARK-40305 > URL: https://issues.apache.org/jira/browse/SPARK-40305 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40305) Implement Groupby.sem
[ https://issues.apache.org/jira/browse/SPARK-40305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40305: Assignee: (was: Apache Spark) > Implement Groupby.sem > - > > Key: SPARK-40305 > URL: https://issues.apache.org/jira/browse/SPARK-40305 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40305) Implement Groupby.sem
[ https://issues.apache.org/jira/browse/SPARK-40305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598906#comment-17598906 ] Apache Spark commented on SPARK-40305: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37756 > Implement Groupby.sem > - > > Key: SPARK-40305 > URL: https://issues.apache.org/jira/browse/SPARK-40305 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40305) Implement Groupby.sem
[ https://issues.apache.org/jira/browse/SPARK-40305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40305: Assignee: Apache Spark > Implement Groupby.sem > - > > Key: SPARK-40305 > URL: https://issues.apache.org/jira/browse/SPARK-40305 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key
[ https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40149: Assignee: Apache Spark > Star expansion after outer join asymmetrically includes joining key > --- > > Key: SPARK-40149 > URL: https://issues.apache.org/jira/browse/SPARK-40149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Otakar Truněček >Assignee: Apache Spark >Priority: Blocker > > When star expansion is used on left side of a join, the result will include > joining key, while on the right side of join it doesn't. I would expect the > behaviour to be symmetric (either include on both sides or on neither). > Example: > {code:python} > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > spark = SparkSession.builder.getOrCreate() > df_left = spark.range(5).withColumn('val', f.lit('left')) > df_right = spark.range(3, 7).withColumn('val', f.lit('right')) > df_merged = ( > df_left > .alias('left') > .join(df_right.alias('right'), on='id', how='full_outer') > .withColumn('left_all', f.struct('left.*')) > .withColumn('right_all', f.struct('right.*')) > ) > df_merged.show() > {code} > result: > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {0, left}| {null}| > | 1|left| null| {1, left}| {null}| > | 2|left| null| {2, left}| {null}| > | 3|left|right| {3, left}| {right}| > | 4|left|right| {4, left}| {right}| > | 5|null|right|{null, null}| {right}| > | 6|null|right|{null, null}| {right}| > +---++-++-+ > {code} > This behaviour started with release 3.2.0. Previously the key was not > included on either side. > Result from Spark 3.1.3 > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {left}| {null}| > | 6|null|right| {null}| {right}| > | 5|null|right| {null}| {right}| > | 1|left| null| {left}| {null}| > | 3|left|right| {left}| {right}| > | 2|left| null| {left}| {null}| > | 4|left|right| {left}| {right}| > +---++-++-+ {code} > I have a gut feeling this is related to these issues: > https://issues.apache.org/jira/browse/SPARK-39376 > https://issues.apache.org/jira/browse/SPARK-34527 > https://issues.apache.org/jira/browse/SPARK-38603 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key
[ https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598938#comment-17598938 ] Apache Spark commented on SPARK-40149: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/37758 > Star expansion after outer join asymmetrically includes joining key > --- > > Key: SPARK-40149 > URL: https://issues.apache.org/jira/browse/SPARK-40149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Otakar Truněček >Priority: Blocker > > When star expansion is used on left side of a join, the result will include > joining key, while on the right side of join it doesn't. I would expect the > behaviour to be symmetric (either include on both sides or on neither). > Example: > {code:python} > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > spark = SparkSession.builder.getOrCreate() > df_left = spark.range(5).withColumn('val', f.lit('left')) > df_right = spark.range(3, 7).withColumn('val', f.lit('right')) > df_merged = ( > df_left > .alias('left') > .join(df_right.alias('right'), on='id', how='full_outer') > .withColumn('left_all', f.struct('left.*')) > .withColumn('right_all', f.struct('right.*')) > ) > df_merged.show() > {code} > result: > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {0, left}| {null}| > | 1|left| null| {1, left}| {null}| > | 2|left| null| {2, left}| {null}| > | 3|left|right| {3, left}| {right}| > | 4|left|right| {4, left}| {right}| > | 5|null|right|{null, null}| {right}| > | 6|null|right|{null, null}| {right}| > +---++-++-+ > {code} > This behaviour started with release 3.2.0. Previously the key was not > included on either side. > Result from Spark 3.1.3 > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {left}| {null}| > | 6|null|right| {null}| {right}| > | 5|null|right| {null}| {right}| > | 1|left| null| {left}| {null}| > | 3|left|right| {left}| {right}| > | 2|left| null| {left}| {null}| > | 4|left|right| {left}| {right}| > +---++-++-+ {code} > I have a gut feeling this is related to these issues: > https://issues.apache.org/jira/browse/SPARK-39376 > https://issues.apache.org/jira/browse/SPARK-34527 > https://issues.apache.org/jira/browse/SPARK-38603 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key
[ https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598939#comment-17598939 ] Apache Spark commented on SPARK-40149: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/37758 > Star expansion after outer join asymmetrically includes joining key > --- > > Key: SPARK-40149 > URL: https://issues.apache.org/jira/browse/SPARK-40149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Otakar Truněček >Priority: Blocker > > When star expansion is used on left side of a join, the result will include > joining key, while on the right side of join it doesn't. I would expect the > behaviour to be symmetric (either include on both sides or on neither). > Example: > {code:python} > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > spark = SparkSession.builder.getOrCreate() > df_left = spark.range(5).withColumn('val', f.lit('left')) > df_right = spark.range(3, 7).withColumn('val', f.lit('right')) > df_merged = ( > df_left > .alias('left') > .join(df_right.alias('right'), on='id', how='full_outer') > .withColumn('left_all', f.struct('left.*')) > .withColumn('right_all', f.struct('right.*')) > ) > df_merged.show() > {code} > result: > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {0, left}| {null}| > | 1|left| null| {1, left}| {null}| > | 2|left| null| {2, left}| {null}| > | 3|left|right| {3, left}| {right}| > | 4|left|right| {4, left}| {right}| > | 5|null|right|{null, null}| {right}| > | 6|null|right|{null, null}| {right}| > +---++-++-+ > {code} > This behaviour started with release 3.2.0. Previously the key was not > included on either side. > Result from Spark 3.1.3 > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {left}| {null}| > | 6|null|right| {null}| {right}| > | 5|null|right| {null}| {right}| > | 1|left| null| {left}| {null}| > | 3|left|right| {left}| {right}| > | 2|left| null| {left}| {null}| > | 4|left|right| {left}| {right}| > +---++-++-+ {code} > I have a gut feeling this is related to these issues: > https://issues.apache.org/jira/browse/SPARK-39376 > https://issues.apache.org/jira/browse/SPARK-34527 > https://issues.apache.org/jira/browse/SPARK-38603 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key
[ https://issues.apache.org/jira/browse/SPARK-40149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40149: Assignee: (was: Apache Spark) > Star expansion after outer join asymmetrically includes joining key > --- > > Key: SPARK-40149 > URL: https://issues.apache.org/jira/browse/SPARK-40149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Otakar Truněček >Priority: Blocker > > When star expansion is used on left side of a join, the result will include > joining key, while on the right side of join it doesn't. I would expect the > behaviour to be symmetric (either include on both sides or on neither). > Example: > {code:python} > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > spark = SparkSession.builder.getOrCreate() > df_left = spark.range(5).withColumn('val', f.lit('left')) > df_right = spark.range(3, 7).withColumn('val', f.lit('right')) > df_merged = ( > df_left > .alias('left') > .join(df_right.alias('right'), on='id', how='full_outer') > .withColumn('left_all', f.struct('left.*')) > .withColumn('right_all', f.struct('right.*')) > ) > df_merged.show() > {code} > result: > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {0, left}| {null}| > | 1|left| null| {1, left}| {null}| > | 2|left| null| {2, left}| {null}| > | 3|left|right| {3, left}| {right}| > | 4|left|right| {4, left}| {right}| > | 5|null|right|{null, null}| {right}| > | 6|null|right|{null, null}| {right}| > +---++-++-+ > {code} > This behaviour started with release 3.2.0. Previously the key was not > included on either side. > Result from Spark 3.1.3 > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {left}| {null}| > | 6|null|right| {null}| {right}| > | 5|null|right| {null}| {right}| > | 1|left| null| {left}| {null}| > | 3|left|right| {left}| {right}| > | 2|left| null| {left}| {null}| > | 4|left|right| {left}| {right}| > +---++-++-+ {code} > I have a gut feeling this is related to these issues: > https://issues.apache.org/jira/browse/SPARK-39376 > https://issues.apache.org/jira/browse/SPARK-34527 > https://issues.apache.org/jira/browse/SPARK-38603 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40279) Document spark.yarn.report.interval
[ https://issues.apache.org/jira/browse/SPARK-40279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-40279: Assignee: Luca Canali > Document spark.yarn.report.interval > --- > > Key: SPARK-40279 > URL: https://issues.apache.org/jira/browse/SPARK-40279 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.3.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > > This proposes to document the configuration paramter > spark.yarn.report.interval -> Interval between reports of the current Spark > job status in cluster mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40279) Document spark.yarn.report.interval
[ https://issues.apache.org/jira/browse/SPARK-40279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40279. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37731 [https://github.com/apache/spark/pull/37731] > Document spark.yarn.report.interval > --- > > Key: SPARK-40279 > URL: https://issues.apache.org/jira/browse/SPARK-40279 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.3.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Fix For: 3.4.0 > > > This proposes to document the configuration paramter > spark.yarn.report.interval -> Interval between reports of the current Spark > job status in cluster mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36862) ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'
[ https://issues.apache.org/jira/browse/SPARK-36862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599013#comment-17599013 ] Lukas Waldmann commented on SPARK-36862: I manage reproduce the issue in my environment. Problem is on line 192 - variable name in function header having array index Here is the generated code {code:java} /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIteratorForCodegenStage636(references); /* 003 */ } /* 004 */ /* 005 */ // codegenStageId=636 /* 006 */ final class GeneratedIteratorForCodegenStage636 extends org.apache.spark.sql.execution.BufferedRowIterator { /* 007 */ private Object[] references; /* 008 */ private scala.collection.Iterator[] inputs; /* 009 */ private scala.collection.Iterator smj_leftInput_0; /* 010 */ private scala.collection.Iterator smj_rightInput_0; /* 011 */ private InternalRow smj_leftRow_0; /* 012 */ private InternalRow smj_rightRow_0; /* 013 */ private boolean smj_globalIsNull_0; /* 014 */ private boolean smj_globalIsNull_1; /* 015 */ private double smj_value_27; /* 016 */ private org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray smj_matches_0; /* 017 */ private double smj_value_28; /* 018 */ private boolean smj_isNull_25; /* 019 */ private boolean smj_isNull_26; /* 020 */ private boolean smj_isNull_27; /* 021 */ private boolean smj_isNull_28; /* 022 */ private boolean smj_isNull_29; /* 023 */ private boolean smj_isNull_30; /* 024 */ private boolean project_subExprIsNull_0; /* 025 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] smj_mutableStateArray_2 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2]; /* 026 */ private java.util.regex.Pattern[] project_mutableStateArray_0 = new java.util.regex.Pattern[1]; /* 027 */ private Decimal[] smj_mutableStateArray_1 = new Decimal[1]; /* 028 */ private String[] project_mutableStateArray_1 = new String[1]; /* 029 */ private UTF8String[] smj_mutableStateArray_0 = new UTF8String[7]; /* 030 */ /* 031 */ public GeneratedIteratorForCodegenStage636(Object[] references) { /* 032 */ this.references = references; /* 033 */ } /* 034 */ /* 035 */ public void init(int index, scala.collection.Iterator[] inputs) { /* 036 */ partitionIndex = index; /* 037 */ this.inputs = inputs; /* 038 */ smj_leftInput_0 = inputs[0]; /* 039 */ smj_rightInput_0 = inputs[1]; /* 040 */ /* 041 */ smj_matches_0 = new org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(2147483632, 2147483647); /* 042 */ smj_mutableStateArray_2[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(6, 192); /* 043 */ smj_mutableStateArray_2[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(6, 192); /* 044 */ /* 045 */ } /* 046 */ /* 047 */ private boolean smj_findNextOuterJoinRows_0( /* 048 */ scala.collection.Iterator leftIter, /* 049 */ scala.collection.Iterator rightIter) { /* 050 */ smj_leftRow_0 = null; /* 051 */ int comp = 0; /* 052 */ while (smj_leftRow_0 == null) { /* 053 */ if (!leftIter.hasNext()) return false; /* 054 */ smj_leftRow_0 = (InternalRow) leftIter.next(); /* 055 */ UTF8String smj_value_22 = smj_If_0(smj_leftRow_0); /* 056 */ boolean smj_isNull_2 = smj_globalIsNull_1; /* 057 */ double smj_value_2 = -1.0; /* 058 */ if (!smj_globalIsNull_1) { /* 059 */ final String smj_doubleStr_0 = smj_value_22.toString(); /* 060 */ try { /* 061 */ smj_value_2 = Double.valueOf(smj_doubleStr_0); /* 062 */ } catch (java.lang.NumberFormatException e) { /* 063 */ final Double d = (Double) Cast.processFloatingPointSpecialLiterals(smj_doubleStr_0, false); /* 064 */ if (d == null) { /* 065 */ smj_isNull_2 = true; /* 066 */ } else { /* 067 */ smj_value_2 = d.doubleValue(); /* 068 */ } /* 069 */ } /* 070 */ } /* 071 */ boolean smj_isNull_1 = smj_isNull_2; /* 072 */ double smj_value_1 = -1.0; /* 073 */ /* 074 */ if (!smj_isNull_2) { /* 075 */ if (Double.isNaN(smj_value_2)) { /* 076 */ smj_value_1 = Double.NaN; /* 077 */ } else if (smj_value_2 == -0.0d) { /* 078 */ smj_value_1 = 0.0d; /* 079 */ } else { /* 080 */ smj_value_1 = smj_value_2; /* 081 */ } /* 082 */ /* 083 */ } /* 084 */ if (smj_isNull_1) { /* 085 */ if (!smj_matches_0.isEmpty()) { /* 086 */ smj_matches_0.clear(); /* 087 */ } /* 088 */ return true; /* 089 */ } /* 090 */ if (!smj_matches_0.isEmpty()) { /* 091 */ comp = 0; /* 092 */ if (comp == 0) { /* 093 */ comp = org.apache.spark.sql.catalyst.util.SQLOrderingUtil.compareDoubles(smj_value_1, smj_value_28); /* 094 */ } /* 095 */ /* 096 */ if (comp == 0) { /* 097 */ return true; /* 098 */ } /* 099 */ smj_matches_0.clear(); /* 100 */ } /* 101 */ /* 102 */ do { /* 103 */ if (smj_rightRow_0 == null) { /* 104 */ if (!rightIter.hasNext()) { /* 105 */ if (!smj_matches_0.isEmpty()) { /* 106 */ smj_value_28 = smj_value_1; /* 107 */ } /* 108 */ return true; /* 109 */ } /* 110 */ smj_rightRow_0 = (InternalRow)
[jira] [Comment Edited] (SPARK-36862) ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'
[ https://issues.apache.org/jira/browse/SPARK-36862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599013#comment-17599013 ] Lukas Waldmann edited comment on SPARK-36862 at 9/1/22 2:55 PM: I managed to reproduce the issue in my environment. Problem is on line 192 - variable name in function header having array index Here is the generated code {code:java} /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIteratorForCodegenStage636(references); /* 003 */ } /* 004 */ /* 005 */ // codegenStageId=636 /* 006 */ final class GeneratedIteratorForCodegenStage636 extends org.apache.spark.sql.execution.BufferedRowIterator { /* 007 */ private Object[] references; /* 008 */ private scala.collection.Iterator[] inputs; /* 009 */ private scala.collection.Iterator smj_leftInput_0; /* 010 */ private scala.collection.Iterator smj_rightInput_0; /* 011 */ private InternalRow smj_leftRow_0; /* 012 */ private InternalRow smj_rightRow_0; /* 013 */ private boolean smj_globalIsNull_0; /* 014 */ private boolean smj_globalIsNull_1; /* 015 */ private double smj_value_27; /* 016 */ private org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray smj_matches_0; /* 017 */ private double smj_value_28; /* 018 */ private boolean smj_isNull_25; /* 019 */ private boolean smj_isNull_26; /* 020 */ private boolean smj_isNull_27; /* 021 */ private boolean smj_isNull_28; /* 022 */ private boolean smj_isNull_29; /* 023 */ private boolean smj_isNull_30; /* 024 */ private boolean project_subExprIsNull_0; /* 025 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] smj_mutableStateArray_2 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2]; /* 026 */ private java.util.regex.Pattern[] project_mutableStateArray_0 = new java.util.regex.Pattern[1]; /* 027 */ private Decimal[] smj_mutableStateArray_1 = new Decimal[1]; /* 028 */ private String[] project_mutableStateArray_1 = new String[1]; /* 029 */ private UTF8String[] smj_mutableStateArray_0 = new UTF8String[7]; /* 030 */ /* 031 */ public GeneratedIteratorForCodegenStage636(Object[] references) { /* 032 */ this.references = references; /* 033 */ } /* 034 */ /* 035 */ public void init(int index, scala.collection.Iterator[] inputs) { /* 036 */ partitionIndex = index; /* 037 */ this.inputs = inputs; /* 038 */ smj_leftInput_0 = inputs[0]; /* 039 */ smj_rightInput_0 = inputs[1]; /* 040 */ /* 041 */ smj_matches_0 = new org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(2147483632, 2147483647); /* 042 */ smj_mutableStateArray_2[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(6, 192); /* 043 */ smj_mutableStateArray_2[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(6, 192); /* 044 */ /* 045 */ } /* 046 */ /* 047 */ private boolean smj_findNextOuterJoinRows_0( /* 048 */ scala.collection.Iterator leftIter, /* 049 */ scala.collection.Iterator rightIter) { /* 050 */ smj_leftRow_0 = null; /* 051 */ int comp = 0; /* 052 */ while (smj_leftRow_0 == null) { /* 053 */ if (!leftIter.hasNext()) return false; /* 054 */ smj_leftRow_0 = (InternalRow) leftIter.next(); /* 055 */ UTF8String smj_value_22 = smj_If_0(smj_leftRow_0); /* 056 */ boolean smj_isNull_2 = smj_globalIsNull_1; /* 057 */ double smj_value_2 = -1.0; /* 058 */ if (!smj_globalIsNull_1) { /* 059 */ final String smj_doubleStr_0 = smj_value_22.toString(); /* 060 */ try { /* 061 */ smj_value_2 = Double.valueOf(smj_doubleStr_0); /* 062 */ } catch (java.lang.NumberFormatException e) { /* 063 */ final Double d = (Double) Cast.processFloatingPointSpecialLiterals(smj_doubleStr_0, false); /* 064 */ if (d == null) { /* 065 */ smj_isNull_2 = true; /* 066 */ } else { /* 067 */ smj_value_2 = d.doubleValue(); /* 068 */ } /* 069 */ } /* 070 */ } /* 071 */ boolean smj_isNull_1 = smj_isNull_2; /* 072 */ double smj_value_1 = -1.0; /* 073 */ /* 074 */ if (!smj_isNull_2) { /* 075 */ if (Double.isNaN(smj_value_2)) { /* 076 */ smj_value_1 = Double.NaN; /* 077 */ } else if (smj_value_2 == -0.0d) { /* 078 */ smj_value_1 = 0.0d; /* 079 */ } else { /* 080 */ smj_value_1 = smj_value_2; /* 081 */ } /* 082 */ /* 083 */ } /* 084 */ if (smj_isNull_1) { /* 085 */ if (!smj_matches_0.isEmpty()) { /* 086 */ smj_matches_0.clear(); /* 087 */ } /* 088 */ return true; /* 089 */ } /* 090 */ if (!smj_matches_0.isEmpty()) { /* 091 */ comp = 0; /* 092 */ if (comp == 0) { /* 093 */ comp = org.apache.spark.sql.catalyst.util.SQLOrderingUtil.compareDoubles(smj_value_1, smj_value_28); /* 094 */ } /* 095 */ /* 096 */ if (comp == 0) { /* 097 */ return true; /* 098 */ } /* 099 */ smj_matches_0.clear(); /* 100 */ } /* 101 */ /* 102 */ do { /* 103 */ if (smj_rightRow_0 == null) { /* 104 */ if (!rightIter.hasNext()) { /* 105 */ if (!smj_matches_0.isEmpty()) { /* 106 */ smj_value_28 = smj_value_1; /* 107 */ } /* 108 */ return true; /*
[jira] [Created] (SPARK-40306) Support more than Integer.MAX_VALUE of the same join key
Wan Kun created SPARK-40306: --- Summary: Support more than Integer.MAX_VALUE of the same join key Key: SPARK-40306 URL: https://issues.apache.org/jira/browse/SPARK-40306 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Wan Kun For SMJ, the number of the same join key records of the right table is greater than Integer.MAX_VALUE, the result will be incorrect. Before SMJ JOIN, we will put the records of the same join key into the ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex overflow may cause incorrect result. For example, one of our production table: !image-2022-09-01-23-01-46-282.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40306) Support more than Integer.MAX_VALUE of the same join key
[ https://issues.apache.org/jira/browse/SPARK-40306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-40306: Description: For SMJ, the number of the same join key records of the right table is greater than Integer.MAX_VALUE, the result will be incorrect. Before SMJ JOIN, we will put the records of the same join key into the ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex overflow may cause incorrect result. For example, one of our production table: !image-2022-09-01-23-02-15-955.png! was: For SMJ, the number of the same join key records of the right table is greater than Integer.MAX_VALUE, the result will be incorrect. Before SMJ JOIN, we will put the records of the same join key into the ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex overflow may cause incorrect result. For example, one of our production table: !image-2022-09-01-23-01-46-282.png! > Support more than Integer.MAX_VALUE of the same join key > > > Key: SPARK-40306 > URL: https://issues.apache.org/jira/browse/SPARK-40306 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wan Kun >Priority: Major > Attachments: image-2022-09-01-23-02-15-955.png > > > For SMJ, the number of the same join key records of the right table is > greater than Integer.MAX_VALUE, the result will be incorrect. > Before SMJ JOIN, we will put the records of the same join key into the > ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows > overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex > overflow may cause incorrect result. > For example, one of our production table: > !image-2022-09-01-23-02-15-955.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40306) Support more than Integer.MAX_VALUE of the same join key
[ https://issues.apache.org/jira/browse/SPARK-40306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wan Kun updated SPARK-40306: Attachment: image-2022-09-01-23-02-15-955.png > Support more than Integer.MAX_VALUE of the same join key > > > Key: SPARK-40306 > URL: https://issues.apache.org/jira/browse/SPARK-40306 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wan Kun >Priority: Major > Attachments: image-2022-09-01-23-02-15-955.png > > > For SMJ, the number of the same join key records of the right table is > greater than Integer.MAX_VALUE, the result will be incorrect. > Before SMJ JOIN, we will put the records of the same join key into the > ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows > overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex > overflow may cause incorrect result. > For example, one of our production table: > !image-2022-09-01-23-01-46-282.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40306) Support more than Integer.MAX_VALUE of the same join key
[ https://issues.apache.org/jira/browse/SPARK-40306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40306: Assignee: Apache Spark > Support more than Integer.MAX_VALUE of the same join key > > > Key: SPARK-40306 > URL: https://issues.apache.org/jira/browse/SPARK-40306 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wan Kun >Assignee: Apache Spark >Priority: Major > Attachments: image-2022-09-01-23-02-15-955.png > > > For SMJ, the number of the same join key records of the right table is > greater than Integer.MAX_VALUE, the result will be incorrect. > Before SMJ JOIN, we will put the records of the same join key into the > ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows > overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex > overflow may cause incorrect result. > For example, one of our production table: > !image-2022-09-01-23-02-15-955.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40306) Support more than Integer.MAX_VALUE of the same join key
[ https://issues.apache.org/jira/browse/SPARK-40306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599023#comment-17599023 ] Apache Spark commented on SPARK-40306: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/37759 > Support more than Integer.MAX_VALUE of the same join key > > > Key: SPARK-40306 > URL: https://issues.apache.org/jira/browse/SPARK-40306 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wan Kun >Priority: Major > Attachments: image-2022-09-01-23-02-15-955.png > > > For SMJ, the number of the same join key records of the right table is > greater than Integer.MAX_VALUE, the result will be incorrect. > Before SMJ JOIN, we will put the records of the same join key into the > ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows > overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex > overflow may cause incorrect result. > For example, one of our production table: > !image-2022-09-01-23-02-15-955.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40306) Support more than Integer.MAX_VALUE of the same join key
[ https://issues.apache.org/jira/browse/SPARK-40306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599021#comment-17599021 ] Apache Spark commented on SPARK-40306: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/37759 > Support more than Integer.MAX_VALUE of the same join key > > > Key: SPARK-40306 > URL: https://issues.apache.org/jira/browse/SPARK-40306 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wan Kun >Priority: Major > Attachments: image-2022-09-01-23-02-15-955.png > > > For SMJ, the number of the same join key records of the right table is > greater than Integer.MAX_VALUE, the result will be incorrect. > Before SMJ JOIN, we will put the records of the same join key into the > ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows > overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex > overflow may cause incorrect result. > For example, one of our production table: > !image-2022-09-01-23-02-15-955.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40306) Support more than Integer.MAX_VALUE of the same join key
[ https://issues.apache.org/jira/browse/SPARK-40306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40306: Assignee: (was: Apache Spark) > Support more than Integer.MAX_VALUE of the same join key > > > Key: SPARK-40306 > URL: https://issues.apache.org/jira/browse/SPARK-40306 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wan Kun >Priority: Major > Attachments: image-2022-09-01-23-02-15-955.png > > > For SMJ, the number of the same join key records of the right table is > greater than Integer.MAX_VALUE, the result will be incorrect. > Before SMJ JOIN, we will put the records of the same join key into the > ExternalAppendOnlyUnsafeRowArray. ExternalAppendOnlyUnsafeRowArray.numRows > overflow may cause OOM. During SMJ JOIN, SpillableArrayIterator.startIndex > overflow may cause incorrect result. > For example, one of our production table: > !image-2022-09-01-23-02-15-955.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38404) Spark does not find CTE inside nested CTE
[ https://issues.apache.org/jira/browse/SPARK-38404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599028#comment-17599028 ] Apache Spark commented on SPARK-38404: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/37760 > Spark does not find CTE inside nested CTE > - > > Key: SPARK-38404 > URL: https://issues.apache.org/jira/browse/SPARK-38404 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 > Environment: Tested on: > * MacOS Monterrey 12.2.1 (21D62) > * python 3.9.10 > * pip 22.0.3 > * pyspark 3.2.0 & 3.2.1 (SQL query does not work) and pyspark 3.0.1 and > 3.1.3 (SQL query works) >Reporter: Joan Heredia Rius >Assignee: Peter Toth >Priority: Minor > Fix For: 3.4.0 > > > Hello! > Seems that when defining CTEs and using them inside another CTE in Spark SQL, > Spark thinks the inner call for the CTE is a table or view, which is not > found and then it errors with `Table or view not found: ` > h3. Steps to reproduce > # `pip install pyspark==3.2.0` (also happens with 3.2.1) > # start pyspark console by typing `pyspark` in the terminal > # Try to run the following SQL with `spark.sql(sql)` > > {code:java} > WITH mock_cte__usersAS ( >SELECT 1 AS id >), >model_under_test AS ( > WITH usersAS ( > SELECT * > FROM mock_cte__users > ) >SELECT * > FROM users >) > SELECT * > FROM model_under_test;{code} > Spark will fail with > > {code:java} > pyspark.sql.utils.AnalysisException: Table or view not found: > mock_cte__users; line 8 pos 29; {code} > I don't know if this is a regression or an expected behavior of the new 3.2.* > versions. This fix introduced in 3.2.0 might be related: > https://issues.apache.org/jira/browse/SPARK-36447 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38404) Spark does not find CTE inside nested CTE
[ https://issues.apache.org/jira/browse/SPARK-38404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599027#comment-17599027 ] Apache Spark commented on SPARK-38404: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/37760 > Spark does not find CTE inside nested CTE > - > > Key: SPARK-38404 > URL: https://issues.apache.org/jira/browse/SPARK-38404 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 > Environment: Tested on: > * MacOS Monterrey 12.2.1 (21D62) > * python 3.9.10 > * pip 22.0.3 > * pyspark 3.2.0 & 3.2.1 (SQL query does not work) and pyspark 3.0.1 and > 3.1.3 (SQL query works) >Reporter: Joan Heredia Rius >Assignee: Peter Toth >Priority: Minor > Fix For: 3.4.0 > > > Hello! > Seems that when defining CTEs and using them inside another CTE in Spark SQL, > Spark thinks the inner call for the CTE is a table or view, which is not > found and then it errors with `Table or view not found: ` > h3. Steps to reproduce > # `pip install pyspark==3.2.0` (also happens with 3.2.1) > # start pyspark console by typing `pyspark` in the terminal > # Try to run the following SQL with `spark.sql(sql)` > > {code:java} > WITH mock_cte__usersAS ( >SELECT 1 AS id >), >model_under_test AS ( > WITH usersAS ( > SELECT * > FROM mock_cte__users > ) >SELECT * > FROM users >) > SELECT * > FROM model_under_test;{code} > Spark will fail with > > {code:java} > pyspark.sql.utils.AnalysisException: Table or view not found: > mock_cte__users; line 8 pos 29; {code} > I don't know if this is a regression or an expected behavior of the new 3.2.* > versions. This fix introduced in 3.2.0 might be related: > https://issues.apache.org/jira/browse/SPARK-36447 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40302) Add YuniKornSuite
[ https://issues.apache.org/jira/browse/SPARK-40302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-40302: - Assignee: Dongjoon Hyun > Add YuniKornSuite > - > > Key: SPARK-40302 > URL: https://issues.apache.org/jira/browse/SPARK-40302 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.4.0, 3.3.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40302) Add YuniKornSuite
[ https://issues.apache.org/jira/browse/SPARK-40302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-40302. --- Fix Version/s: 3.3.1 3.4.0 Resolution: Fixed Issue resolved by pull request 37753 [https://github.com/apache/spark/pull/37753] > Add YuniKornSuite > - > > Key: SPARK-40302 > URL: https://issues.apache.org/jira/browse/SPARK-40302 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.4.0, 3.3.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.1, 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40304) Add decomTestTag to K8s Integration Test
[ https://issues.apache.org/jira/browse/SPARK-40304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-40304. --- Fix Version/s: 3.4.0 3.3.1 Assignee: Dongjoon Hyun Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/37755 > Add decomTestTag to K8s Integration Test > > > Key: SPARK-40304 > URL: https://issues.apache.org/jira/browse/SPARK-40304 > Project: Spark > Issue Type: Test > Components: Kubernetes, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.4.0, 3.3.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40302) Add YuniKornSuite
[ https://issues.apache.org/jira/browse/SPARK-40302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-40302: -- Parent: SPARK-36057 Issue Type: Sub-task (was: Test) > Add YuniKornSuite > - > > Key: SPARK-40302 > URL: https://issues.apache.org/jira/browse/SPARK-40302 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.4.0, 3.3.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0, 3.3.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39996) Upgrade postgresql to 42.5.0
[ https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bjørn Jørgensen updated SPARK-39996: Summary: Upgrade postgresql to 42.5.0 (was: Upgrade postgresql to 42.4.1) > Upgrade postgresql to 42.5.0 > > > Key: SPARK-39996 > URL: https://issues.apache.org/jira/browse/SPARK-39996 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > Security > - fix: > [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197] > Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so > as to prevent SQL injection. > - Previously, the column names for both key and data columns in the table > were copied as-is into the generated > SQL. This allowed a malicious table with column names that include > statement terminator to be parsed and > executed as multiple separate commands. > - Also adds a new test class ResultSetRefreshTest to verify this change. > - Reported by [Sho Kato](https://github.com/kato-sho) > [Release > note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39996) Upgrade postgresql to 42.5.0
[ https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39996: Assignee: Apache Spark > Upgrade postgresql to 42.5.0 > > > Key: SPARK-39996 > URL: https://issues.apache.org/jira/browse/SPARK-39996 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Apache Spark >Priority: Major > > Security > - fix: > [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197] > Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so > as to prevent SQL injection. > - Previously, the column names for both key and data columns in the table > were copied as-is into the generated > SQL. This allowed a malicious table with column names that include > statement terminator to be parsed and > executed as multiple separate commands. > - Also adds a new test class ResultSetRefreshTest to verify this change. > - Reported by [Sho Kato](https://github.com/kato-sho) > [Release > note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39996) Upgrade postgresql to 42.5.0
[ https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599103#comment-17599103 ] Apache Spark commented on SPARK-39996: -- User 'bjornjorgensen' has created a pull request for this issue: https://github.com/apache/spark/pull/37762 > Upgrade postgresql to 42.5.0 > > > Key: SPARK-39996 > URL: https://issues.apache.org/jira/browse/SPARK-39996 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > Security > - fix: > [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197] > Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so > as to prevent SQL injection. > - Previously, the column names for both key and data columns in the table > were copied as-is into the generated > SQL. This allowed a malicious table with column names that include > statement terminator to be parsed and > executed as multiple separate commands. > - Also adds a new test class ResultSetRefreshTest to verify this change. > - Reported by [Sho Kato](https://github.com/kato-sho) > [Release > note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39996) Upgrade postgresql to 42.5.0
[ https://issues.apache.org/jira/browse/SPARK-39996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39996: Assignee: (was: Apache Spark) > Upgrade postgresql to 42.5.0 > > > Key: SPARK-39996 > URL: https://issues.apache.org/jira/browse/SPARK-39996 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > Security > - fix: > [CVE-2022-31197|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-31197] > Fixes SQL generated in PgResultSet.refresh() to escape column identifiers so > as to prevent SQL injection. > - Previously, the column names for both key and data columns in the table > were copied as-is into the generated > SQL. This allowed a malicious table with column names that include > statement terminator to be parsed and > executed as multiple separate commands. > - Also adds a new test class ResultSetRefreshTest to verify this change. > - Reported by [Sho Kato](https://github.com/kato-sho) > [Release > note|https://github.com/pgjdbc/pgjdbc/commit/bd91c4cc76cdfc1ffd0322be80c85ddfe08a38c2] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37809) Add yunikorn feature step
[ https://issues.apache.org/jira/browse/SPARK-37809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang resolved SPARK-37809. - Resolution: Won't Do Based on the feedback from the community, there is no need to add the extra feature step. YuniKorn is able to work with the current version of Spark without any code changes. Please see doc https://issues.apache.org/jira/browse/SPARK-40187. > Add yunikorn feature step > - > > Key: SPARK-37809 > URL: https://issues.apache.org/jira/browse/SPARK-37809 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Weiwei Yang >Priority: Major > > Based on the design doc > https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg. > This task is to add YuniKornFeatureStep in order to integrate with YuniKorn. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38310) Support job queue in YuniKorn feature step
[ https://issues.apache.org/jira/browse/SPARK-38310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang resolved SPARK-38310. - Resolution: Won't Do YuniKorn will follow the standard Spark API for the integration, there is no need to add code changes to support YuniKorn queues. > Support job queue in YuniKorn feature step > -- > > Key: SPARK-38310 > URL: https://issues.apache.org/jira/browse/SPARK-38310 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.2 > Environment: Like SPARK-38188, yunikorn needs to support the queue > property and use that in the feature step to properly configure > driver/executor pods. >Reporter: Weiwei Yang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40307) Optimize (De)Serialization of Python UDF
Xinrong Meng created SPARK-40307: Summary: Optimize (De)Serialization of Python UDF Key: SPARK-40307 URL: https://issues.apache.org/jira/browse/SPARK-40307 Project: Spark Issue Type: Umbrella Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Python user-defined function (UDF) enables users to run arbitrary code against PySpark columns. It uses Pickle for (de)serialization, and executes row by row. One major performance bottleneck of Python UDFs is (de)serialization, that is, the data interchanging between the worker JVM and the spawned Python subprocess which actually executes the UDF. We should seek for an alternative to handle the (de)serialization: Arrow, which is used in (de)serialization of Pandas UDF already. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33605) Add GCS FS/connector config (dependencies?) akin to S3
[ https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599160#comment-17599160 ] Dongjoon Hyun edited comment on SPARK-33605 at 9/1/22 9:33 PM: --- I made a PR and had a discussion, but this issue is closed based on the following. bq. note that the gcs connector (at leasts the builds off their master) are java 11 only; not sure where that stands w.r.t older releases Apache Spark community has no plan to drop Java 8 yet. was (Author: dongjoon): I made a PR and had a discussion, but this issue is closed based on the following. > note that the gcs connector (at leasts the builds off their master) are java > 11 only; not sure where that stands w.r.t older releases Apache Spark community has no plan to drop Java 8 yet. > Add GCS FS/connector config (dependencies?) akin to S3 > -- > > Key: SPARK-33605 > URL: https://issues.apache.org/jira/browse/SPARK-33605 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.1 >Reporter: Rafal Wojdyla >Priority: Major > > Spark comes with some S3 batteries included, which makes it easier to use > with S3, for GCS to work users are required to manually configure the jars. > This is especially problematic for python users who may not be accustomed to > java dependencies etc. This is an example of workaround for pyspark: > [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the > [GCS > connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage], > it would make things easier for GCS users. > Please let me know what you think. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33605) Add GCS FS/connector config (dependencies?) akin to S3
[ https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33605. --- Resolution: Won't Do I made a PR and had a discussion, but this issue is closed based on the following. > note that the gcs connector (at leasts the builds off their master) are java > 11 only; not sure where that stands w.r.t older releases Apache Spark community has no plan to drop Java 8 yet. > Add GCS FS/connector config (dependencies?) akin to S3 > -- > > Key: SPARK-33605 > URL: https://issues.apache.org/jira/browse/SPARK-33605 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.1 >Reporter: Rafal Wojdyla >Priority: Major > > Spark comes with some S3 batteries included, which makes it easier to use > with S3, for GCS to work users are required to manually configure the jars. > This is especially problematic for python users who may not be accustomed to > java dependencies etc. This is an example of workaround for pyspark: > [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the > [GCS > connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage], > it would make things easier for GCS users. > Please let me know what you think. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-33605) Add GCS FS/connector config (dependencies?) akin to S3
[ https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-33605: --- > Add GCS FS/connector config (dependencies?) akin to S3 > -- > > Key: SPARK-33605 > URL: https://issues.apache.org/jira/browse/SPARK-33605 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.1 >Reporter: Rafal Wojdyla >Priority: Major > > Spark comes with some S3 batteries included, which makes it easier to use > with S3, for GCS to work users are required to manually configure the jars. > This is especially problematic for python users who may not be accustomed to > java dependencies etc. This is an example of workaround for pyspark: > [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the > [GCS > connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage], > it would make things easier for GCS users. > Please let me know what you think. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33605) Add GCS FS/connector config (dependencies?) akin to S3
[ https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33605: Assignee: (was: Apache Spark) > Add GCS FS/connector config (dependencies?) akin to S3 > -- > > Key: SPARK-33605 > URL: https://issues.apache.org/jira/browse/SPARK-33605 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.1 >Reporter: Rafal Wojdyla >Priority: Major > > Spark comes with some S3 batteries included, which makes it easier to use > with S3, for GCS to work users are required to manually configure the jars. > This is especially problematic for python users who may not be accustomed to > java dependencies etc. This is an example of workaround for pyspark: > [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the > [GCS > connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage], > it would make things easier for GCS users. > Please let me know what you think. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33605) Add GCS FS/connector config (dependencies?) akin to S3
[ https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33605: Assignee: Apache Spark > Add GCS FS/connector config (dependencies?) akin to S3 > -- > > Key: SPARK-33605 > URL: https://issues.apache.org/jira/browse/SPARK-33605 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.1 >Reporter: Rafal Wojdyla >Assignee: Apache Spark >Priority: Major > > Spark comes with some S3 batteries included, which makes it easier to use > with S3, for GCS to work users are required to manually configure the jars. > This is especially problematic for python users who may not be accustomed to > java dependencies etc. This is an example of workaround for pyspark: > [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the > [GCS > connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage], > it would make things easier for GCS users. > Please let me know what you think. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33605) Add GCS FS/connector config (dependencies?) akin to S3
[ https://issues.apache.org/jira/browse/SPARK-33605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599162#comment-17599162 ] Dongjoon Hyun commented on SPARK-33605: --- {{My bad. It was Java 8.}} {{- https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/8453ce7ce7510e983bae7470909fbd02704c0539/pom.xml#L76-L77}} {quote}{{8}} {{8}} {quote} > Add GCS FS/connector config (dependencies?) akin to S3 > -- > > Key: SPARK-33605 > URL: https://issues.apache.org/jira/browse/SPARK-33605 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.1 >Reporter: Rafal Wojdyla >Priority: Major > > Spark comes with some S3 batteries included, which makes it easier to use > with S3, for GCS to work users are required to manually configure the jars. > This is especially problematic for python users who may not be accustomed to > java dependencies etc. This is an example of workaround for pyspark: > [pyspark_gcs|https://github.com/ravwojdyla/pyspark_gcs]. If we include the > [GCS > connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage], > it would make things easier for GCS users. > Please let me know what you think. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40308) str_to_map should accept non-foldable delimiter parameters
Bruce Robbins created SPARK-40308: - Summary: str_to_map should accept non-foldable delimiter parameters Key: SPARK-40308 URL: https://issues.apache.org/jira/browse/SPARK-40308 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Bruce Robbins Currently, str_to_map requires the delimiter parameters to be foldable expressions. For example, the following doesn't work in Spark SQL: {noformat} drop table if exists maptbl; create table maptbl as select ',' as del1, ':' as del2, 'a:1,b:2,c:3' as str; insert into table maptbl select '%' as del1, '-' as del2, 'a-1%b-2%c-3' as str; select str, str_to_map(str, del1, del2) from maptbl; {noformat} You get the following error: {noformat} str_to_map's delimiters must be foldable.; line 1 pos 12; {noformat} However, the above example SQL statements do work in Hive 2.3.9. There, you get: {noformat} +--++ | str |_c1 | +--++ | a:1,b:2,c:3 | {"a":"1","b":"2","c":"3"} | | a-1%b-2%c-3 | {"a":"1","b":"2","c":"3"} | +--++ 2 rows selected (0.13 seconds) {noformat} It's unlikely that an input table would have the needed delimiters in columns. The use-case is more likely to be something like this, where the delimiters are determined based on some other value: {noformat} select str, str_to_map(str, ',', if(region = 0, ':', '#')) as m from maptbl2; {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40308) str_to_map should accept non-foldable delimiter arguments
[ https://issues.apache.org/jira/browse/SPARK-40308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-40308: -- Summary: str_to_map should accept non-foldable delimiter arguments (was: str_to_map should accept non-foldable delimiter parameters) > str_to_map should accept non-foldable delimiter arguments > - > > Key: SPARK-40308 > URL: https://issues.apache.org/jira/browse/SPARK-40308 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Bruce Robbins >Priority: Minor > > Currently, str_to_map requires the delimiter parameters to be foldable > expressions. For example, the following doesn't work in Spark SQL: > {noformat} > drop table if exists maptbl; > create table maptbl as select ',' as del1, ':' as del2, 'a:1,b:2,c:3' as str; > insert into table maptbl select '%' as del1, '-' as del2, 'a-1%b-2%c-3' as > str; > select str, str_to_map(str, del1, del2) from maptbl; > {noformat} > You get the following error: > {noformat} > str_to_map's delimiters must be foldable.; line 1 pos 12; > {noformat} > However, the above example SQL statements do work in Hive 2.3.9. There, you > get: > {noformat} > +--++ > | str |_c1 | > +--++ > | a:1,b:2,c:3 | {"a":"1","b":"2","c":"3"} | > | a-1%b-2%c-3 | {"a":"1","b":"2","c":"3"} | > +--++ > 2 rows selected (0.13 seconds) > {noformat} > It's unlikely that an input table would have the needed delimiters in > columns. The use-case is more likely to be something like this, where the > delimiters are determined based on some other value: > {noformat} > select > str, > str_to_map(str, ',', if(region = 0, ':', '#')) as m > from > maptbl2; > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40308) str_to_map should accept non-foldable delimiter arguments
[ https://issues.apache.org/jira/browse/SPARK-40308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-40308: -- Description: Currently, str_to_map requires the delimiter arguments to be foldable expressions. For example, the following doesn't work in Spark SQL: {noformat} drop table if exists maptbl; create table maptbl as select ',' as del1, ':' as del2, 'a:1,b:2,c:3' as str; insert into table maptbl select '%' as del1, '-' as del2, 'a-1%b-2%c-3' as str; select str, str_to_map(str, del1, del2) from maptbl; {noformat} You get the following error: {noformat} str_to_map's delimiters must be foldable.; line 1 pos 12; {noformat} However, the above example SQL statements do work in Hive 2.3.9. There, you get: {noformat} +--++ | str |_c1 | +--++ | a:1,b:2,c:3 | {"a":"1","b":"2","c":"3"} | | a-1%b-2%c-3 | {"a":"1","b":"2","c":"3"} | +--++ 2 rows selected (0.13 seconds) {noformat} It's unlikely that an input table would have the needed delimiters in columns. The use-case is more likely to be something like this, where the delimiters are determined based on some other value: {noformat} select str, str_to_map(str, ',', if(region = 0, ':', '#')) as m from maptbl2; {noformat} was: Currently, str_to_map requires the delimiter parameters to be foldable expressions. For example, the following doesn't work in Spark SQL: {noformat} drop table if exists maptbl; create table maptbl as select ',' as del1, ':' as del2, 'a:1,b:2,c:3' as str; insert into table maptbl select '%' as del1, '-' as del2, 'a-1%b-2%c-3' as str; select str, str_to_map(str, del1, del2) from maptbl; {noformat} You get the following error: {noformat} str_to_map's delimiters must be foldable.; line 1 pos 12; {noformat} However, the above example SQL statements do work in Hive 2.3.9. There, you get: {noformat} +--++ | str |_c1 | +--++ | a:1,b:2,c:3 | {"a":"1","b":"2","c":"3"} | | a-1%b-2%c-3 | {"a":"1","b":"2","c":"3"} | +--++ 2 rows selected (0.13 seconds) {noformat} It's unlikely that an input table would have the needed delimiters in columns. The use-case is more likely to be something like this, where the delimiters are determined based on some other value: {noformat} select str, str_to_map(str, ',', if(region = 0, ':', '#')) as m from maptbl2; {noformat} > str_to_map should accept non-foldable delimiter arguments > - > > Key: SPARK-40308 > URL: https://issues.apache.org/jira/browse/SPARK-40308 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Bruce Robbins >Priority: Minor > > Currently, str_to_map requires the delimiter arguments to be foldable > expressions. For example, the following doesn't work in Spark SQL: > {noformat} > drop table if exists maptbl; > create table maptbl as select ',' as del1, ':' as del2, 'a:1,b:2,c:3' as str; > insert into table maptbl select '%' as del1, '-' as del2, 'a-1%b-2%c-3' as > str; > select str, str_to_map(str, del1, del2) from maptbl; > {noformat} > You get the following error: > {noformat} > str_to_map's delimiters must be foldable.; line 1 pos 12; > {noformat} > However, the above example SQL statements do work in Hive 2.3.9. There, you > get: > {noformat} > +--++ > | str |_c1 | > +--++ > | a:1,b:2,c:3 | {"a":"1","b":"2","c":"3"} | > | a-1%b-2%c-3 | {"a":"1","b":"2","c":"3"} | > +--++ > 2 rows selected (0.13 seconds) > {noformat} > It's unlikely that an input table would have the needed delimiters in > columns. The use-case is more likely to be something like this, where the > delimiters are determined based on some other value: > {noformat} > select > str, > str_to_map(str, ',', if(region = 0, ':', '#')) as m > from > maptbl2; > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40309) Introduce sql_conf context manager for pyspark.sql
Xinrong Meng created SPARK-40309: Summary: Introduce sql_conf context manager for pyspark.sql Key: SPARK-40309 URL: https://issues.apache.org/jira/browse/SPARK-40309 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng [https://github.com/apache/spark/blob/master/python/pyspark/pandas/utils.py#L490] a context manager is introduced to set the Spark SQL configuration and then restores it back when it exits, in Pandas API on Spark. That simplifies the control of Spark SQL configuration, from {code:java} original_value = spark.conf.get("key") spark.conf.set("key", "value") ... spark.conf.set("key", original_value){code} to {code:java} with sql_conf({"key": "value"}): ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40309) Introduce sql_conf context manager for pyspark.sql
[ https://issues.apache.org/jira/browse/SPARK-40309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-40309: - Description: [https://github.com/apache/spark/blob/master/python/pyspark/pandas/utils.py#L490] a context manager is introduced to set the Spark SQL configuration and then restores it back when it exits, in Pandas API on Spark. That simplifies the control of Spark SQL configuration, from {code:java} original_value = spark.conf.get("key") spark.conf.set("key", "value") ... spark.conf.set("key", original_value){code} to {code:java} with sql_conf({"key": "value"}): ... {code} was: [https://github.com/apache/spark/blob/master/python/pyspark/pandas/utils.py#L490] a context manager is introduced to set the Spark SQL configuration and then restores it back when it exits, in Pandas API on Spark. That simplifies the control of Spark SQL configuration, from {code:java} original_value = spark.conf.get("key") spark.conf.set("key", "value") ... spark.conf.set("key", original_value){code} to {code:java} with sql_conf({"key": "value"}): ... {code} > Introduce sql_conf context manager for pyspark.sql > -- > > Key: SPARK-40309 > URL: https://issues.apache.org/jira/browse/SPARK-40309 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > [https://github.com/apache/spark/blob/master/python/pyspark/pandas/utils.py#L490] > a context manager is introduced to set the Spark SQL configuration and > then restores it back when it exits, in Pandas API on Spark. > That simplifies the control of Spark SQL configuration, > from > {code:java} > original_value = spark.conf.get("key") > spark.conf.set("key", "value") > ... > spark.conf.set("key", original_value){code} > to > {code:java} > with sql_conf({"key": "value"}): > ... > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40308) str_to_map should accept non-foldable delimiter arguments
[ https://issues.apache.org/jira/browse/SPARK-40308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599174#comment-17599174 ] Apache Spark commented on SPARK-40308: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/37763 > str_to_map should accept non-foldable delimiter arguments > - > > Key: SPARK-40308 > URL: https://issues.apache.org/jira/browse/SPARK-40308 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Bruce Robbins >Priority: Minor > > Currently, str_to_map requires the delimiter arguments to be foldable > expressions. For example, the following doesn't work in Spark SQL: > {noformat} > drop table if exists maptbl; > create table maptbl as select ',' as del1, ':' as del2, 'a:1,b:2,c:3' as str; > insert into table maptbl select '%' as del1, '-' as del2, 'a-1%b-2%c-3' as > str; > select str, str_to_map(str, del1, del2) from maptbl; > {noformat} > You get the following error: > {noformat} > str_to_map's delimiters must be foldable.; line 1 pos 12; > {noformat} > However, the above example SQL statements do work in Hive 2.3.9. There, you > get: > {noformat} > +--++ > | str |_c1 | > +--++ > | a:1,b:2,c:3 | {"a":"1","b":"2","c":"3"} | > | a-1%b-2%c-3 | {"a":"1","b":"2","c":"3"} | > +--++ > 2 rows selected (0.13 seconds) > {noformat} > It's unlikely that an input table would have the needed delimiters in > columns. The use-case is more likely to be something like this, where the > delimiters are determined based on some other value: > {noformat} > select > str, > str_to_map(str, ',', if(region = 0, ':', '#')) as m > from > maptbl2; > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40308) str_to_map should accept non-foldable delimiter arguments
[ https://issues.apache.org/jira/browse/SPARK-40308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40308: Assignee: (was: Apache Spark) > str_to_map should accept non-foldable delimiter arguments > - > > Key: SPARK-40308 > URL: https://issues.apache.org/jira/browse/SPARK-40308 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Bruce Robbins >Priority: Minor > > Currently, str_to_map requires the delimiter arguments to be foldable > expressions. For example, the following doesn't work in Spark SQL: > {noformat} > drop table if exists maptbl; > create table maptbl as select ',' as del1, ':' as del2, 'a:1,b:2,c:3' as str; > insert into table maptbl select '%' as del1, '-' as del2, 'a-1%b-2%c-3' as > str; > select str, str_to_map(str, del1, del2) from maptbl; > {noformat} > You get the following error: > {noformat} > str_to_map's delimiters must be foldable.; line 1 pos 12; > {noformat} > However, the above example SQL statements do work in Hive 2.3.9. There, you > get: > {noformat} > +--++ > | str |_c1 | > +--++ > | a:1,b:2,c:3 | {"a":"1","b":"2","c":"3"} | > | a-1%b-2%c-3 | {"a":"1","b":"2","c":"3"} | > +--++ > 2 rows selected (0.13 seconds) > {noformat} > It's unlikely that an input table would have the needed delimiters in > columns. The use-case is more likely to be something like this, where the > delimiters are determined based on some other value: > {noformat} > select > str, > str_to_map(str, ',', if(region = 0, ':', '#')) as m > from > maptbl2; > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40308) str_to_map should accept non-foldable delimiter arguments
[ https://issues.apache.org/jira/browse/SPARK-40308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40308: Assignee: Apache Spark > str_to_map should accept non-foldable delimiter arguments > - > > Key: SPARK-40308 > URL: https://issues.apache.org/jira/browse/SPARK-40308 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Bruce Robbins >Assignee: Apache Spark >Priority: Minor > > Currently, str_to_map requires the delimiter arguments to be foldable > expressions. For example, the following doesn't work in Spark SQL: > {noformat} > drop table if exists maptbl; > create table maptbl as select ',' as del1, ':' as del2, 'a:1,b:2,c:3' as str; > insert into table maptbl select '%' as del1, '-' as del2, 'a-1%b-2%c-3' as > str; > select str, str_to_map(str, del1, del2) from maptbl; > {noformat} > You get the following error: > {noformat} > str_to_map's delimiters must be foldable.; line 1 pos 12; > {noformat} > However, the above example SQL statements do work in Hive 2.3.9. There, you > get: > {noformat} > +--++ > | str |_c1 | > +--++ > | a:1,b:2,c:3 | {"a":"1","b":"2","c":"3"} | > | a-1%b-2%c-3 | {"a":"1","b":"2","c":"3"} | > +--++ > 2 rows selected (0.13 seconds) > {noformat} > It's unlikely that an input table would have the needed delimiters in > columns. The use-case is more likely to be something like this, where the > delimiters are determined based on some other value: > {noformat} > select > str, > str_to_map(str, ',', if(region = 0, ':', '#')) as m > from > maptbl2; > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35161) Error-handling SQL functions
[ https://issues.apache.org/jira/browse/SPARK-35161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-35161: --- Epic Link: SPARK-35030 (was: SPARK-38783) > Error-handling SQL functions > > > Key: SPARK-35161 > URL: https://issues.apache.org/jira/browse/SPARK-35161 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Priority: Major > > Create new Error-handling version SQL functions for existing SQL > functions/operators, which returns NULL if overflow/error occurs. So that: > 1. Users can manage to finish queries without interruptions in ANSI mode. > 2. Users can get NULLs instead of unreasonable results if overflow occurs > when ANSI mode is off. > For example, the behavior of the following SQL operations is unreasonable: > {code:java} > 2147483647 + 2 => -2147483647 > CAST(2147483648L AS INT) => -2147483648 > {code} > With the new safe version SQL functions: > {code:java} > TRY_ADD(2147483647, 2) => null > TRY_CAST(2147483648L AS INT) => null > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40310) try_sum() should throw exceptions from its child
Gengliang Wang created SPARK-40310: -- Summary: try_sum() should throw exceptions from its child Key: SPARK-40310 URL: https://issues.apache.org/jira/browse/SPARK-40310 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Gengliang Wang Assignee: Gengliang Wang Similar to [https://github.com/apache/spark/pull/37486] and [https://github.com/apache/spark/pull/37663,] the errors from try_sum()'s child should be shown instead of ignored. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40310) try_sum() should throw the exceptions from its child
[ https://issues.apache.org/jira/browse/SPARK-40310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-40310: --- Summary: try_sum() should throw the exceptions from its child (was: try_sum() should throw exceptions from its child) > try_sum() should throw the exceptions from its child > > > Key: SPARK-40310 > URL: https://issues.apache.org/jira/browse/SPARK-40310 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Similar to [https://github.com/apache/spark/pull/37486] and > [https://github.com/apache/spark/pull/37663,] the errors from try_sum()'s > child should be shown instead of ignored. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40310) try_sum() should throw the exceptions from its child
[ https://issues.apache.org/jira/browse/SPARK-40310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599185#comment-17599185 ] Apache Spark commented on SPARK-40310: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/37764 > try_sum() should throw the exceptions from its child > > > Key: SPARK-40310 > URL: https://issues.apache.org/jira/browse/SPARK-40310 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Similar to [https://github.com/apache/spark/pull/37486] and > [https://github.com/apache/spark/pull/37663,] the errors from try_sum()'s > child should be shown instead of ignored. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40310) try_sum() should throw the exceptions from its child
[ https://issues.apache.org/jira/browse/SPARK-40310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40310: Assignee: Gengliang Wang (was: Apache Spark) > try_sum() should throw the exceptions from its child > > > Key: SPARK-40310 > URL: https://issues.apache.org/jira/browse/SPARK-40310 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Similar to [https://github.com/apache/spark/pull/37486] and > [https://github.com/apache/spark/pull/37663,] the errors from try_sum()'s > child should be shown instead of ignored. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40310) try_sum() should throw the exceptions from its child
[ https://issues.apache.org/jira/browse/SPARK-40310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40310: Assignee: Apache Spark (was: Gengliang Wang) > try_sum() should throw the exceptions from its child > > > Key: SPARK-40310 > URL: https://issues.apache.org/jira/browse/SPARK-40310 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Similar to [https://github.com/apache/spark/pull/37486] and > [https://github.com/apache/spark/pull/37663,] the errors from try_sum()'s > child should be shown instead of ignored. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40311) Introduce withColumnsRenamed
Santosh Pingale created SPARK-40311: --- Summary: Introduce withColumnsRenamed Key: SPARK-40311 URL: https://issues.apache.org/jira/browse/SPARK-40311 Project: Spark Issue Type: Improvement Components: PySpark, SparkR, SQL Affects Versions: 3.2.2, 3.3.0, 3.1.3, 3.0.3 Reporter: Santosh Pingale Add a scala, pyspark, R dataframe API that can rename multiple columns in a single command. This is mostly a performance related optimisations where users iteratively perform `withColumnRenamed`. With 100s columns and multiple iterations, there are cases where either driver will blow up or users will receive a StackOverflowError. {code:java} import datetime import numpy as np import pandas as pd num_rows = 2 num_columns = 100 data = np.zeros((num_rows, num_columns)) columns = map(str, range(num_columns)) raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) a = datetime.datetime.now() {col, f"prefix_{col}" for col in raw.columns} for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") b = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") c = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") d = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") e = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") f = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") g = datetime.datetime.now() g-a datetime.timedelta(seconds=12, microseconds=480021) {code} {code:java} import datetime import numpy as np import pandas as pd num_rows = 2 num_columns = 100 data = np.zeros((num_rows, num_columns)) columns = map(str, range(num_columns)) raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) a = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) b = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) c = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) d = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) e = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) f = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) g = datetime.datetime.now() g-a datetime.timedelta(microseconds=632116) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40311) Introduce withColumnsRenamed
[ https://issues.apache.org/jira/browse/SPARK-40311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santosh Pingale updated SPARK-40311: Description: Add a scala, pyspark, R dataframe API that can rename multiple columns in a single command. This is mostly a performance related optimisations where users iteratively perform `withColumnRenamed`. With 100s columns and multiple iterations, there are cases where either driver will blow up or users will receive a StackOverflowError. {code:java} import datetime import numpy as np import pandas as pd num_rows = 2 num_columns = 100 data = np.zeros((num_rows, num_columns)) columns = map(str, range(num_columns)) raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) a = datetime.datetime.now() {col, f"prefix_{col}" for col in raw.columns} for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") b = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") c = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") d = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") e = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") f = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") g = datetime.datetime.now() g-a datetime.timedelta(seconds=12, microseconds=480021) {code} {code:java} import datetime import numpy as np import pandas as pd num_rows = 2 num_columns = 100 data = np.zeros((num_rows, num_columns)) columns = map(str, range(num_columns)) raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) a = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) b = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) c = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) d = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) e = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) f = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) g = datetime.datetime.now() g-a datetime.timedelta(microseconds=632116) {code} was: Add a scala, pyspark, R dataframe API that can rename multiple columns in a single command. This is mostly a performance related optimisations where users iteratively perform `withColumnRenamed`. With 100s columns and multiple iterations, there are cases where either driver will blow up or users will receive a StackOverflowError. {code:java} import datetime import numpy as np import pandas as pd num_rows = 2 num_columns = 100 data = np.zeros((num_rows, num_columns)) columns = map(str, range(num_columns)) raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) a = datetime.datetime.now() {col, f"prefix_{col}" for col in raw.columns} for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") b = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") c = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") d = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") e = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") f = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") g = datetime.datetime.now() g-a datetime.timedelta(seconds=12, microseconds=480021) {code} {code:java} import datetime import numpy as np import pandas as pd num_rows = 2 num_columns = 100 data = np.zeros((num_rows, num_columns)) columns = map(str, range(num_columns)) raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) a = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) b = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) c = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) d = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) e = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) f = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsR
[jira] [Assigned] (SPARK-40311) Introduce withColumnsRenamed
[ https://issues.apache.org/jira/browse/SPARK-40311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40311: Assignee: Apache Spark > Introduce withColumnsRenamed > > > Key: SPARK-40311 > URL: https://issues.apache.org/jira/browse/SPARK-40311 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR, SQL >Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2 >Reporter: Santosh Pingale >Assignee: Apache Spark >Priority: Minor > > Add a scala, pyspark, R dataframe API that can rename multiple columns in a > single command. This is mostly a performance related optimisations where > users iteratively perform `withColumnRenamed`. With 100s columns and multiple > iterations, there are cases where either driver will blow up or users will > receive a StackOverflowError. > {code:java} > import datetime > import numpy as np > import pandas as pd > num_rows = 2 > num_columns = 100 > data = np.zeros((num_rows, num_columns)) > columns = map(str, range(num_columns)) > raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) > a = datetime.datetime.now() > {col, f"prefix_{col}" for col in raw.columns} > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > b = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > c = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > d = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > e = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > f = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > g = datetime.datetime.now() > g-a > datetime.timedelta(seconds=12, microseconds=480021) {code} > {code:java} > import datetime > import numpy as np > import pandas as pd > num_rows = 2 > num_columns = 100 > data = np.zeros((num_rows, num_columns)) > columns = map(str, range(num_columns)) > raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) > a = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > b = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > c = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > d = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > e = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > f = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > g = datetime.datetime.now() > g-a > datetime.timedelta(microseconds=632116) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40311) Introduce withColumnsRenamed
[ https://issues.apache.org/jira/browse/SPARK-40311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599191#comment-17599191 ] Apache Spark commented on SPARK-40311: -- User 'santosh-d3vpl3x' has created a pull request for this issue: https://github.com/apache/spark/pull/37761 > Introduce withColumnsRenamed > > > Key: SPARK-40311 > URL: https://issues.apache.org/jira/browse/SPARK-40311 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR, SQL >Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2 >Reporter: Santosh Pingale >Priority: Minor > > Add a scala, pyspark, R dataframe API that can rename multiple columns in a > single command. This is mostly a performance related optimisations where > users iteratively perform `withColumnRenamed`. With 100s columns and multiple > iterations, there are cases where either driver will blow up or users will > receive a StackOverflowError. > {code:java} > import datetime > import numpy as np > import pandas as pd > num_rows = 2 > num_columns = 100 > data = np.zeros((num_rows, num_columns)) > columns = map(str, range(num_columns)) > raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) > a = datetime.datetime.now() > {col, f"prefix_{col}" for col in raw.columns} > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > b = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > c = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > d = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > e = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > f = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > g = datetime.datetime.now() > g-a > datetime.timedelta(seconds=12, microseconds=480021) {code} > {code:java} > import datetime > import numpy as np > import pandas as pd > num_rows = 2 > num_columns = 100 > data = np.zeros((num_rows, num_columns)) > columns = map(str, range(num_columns)) > raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) > a = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > b = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > c = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > d = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > e = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > f = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > g = datetime.datetime.now() > g-a > datetime.timedelta(microseconds=632116) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40311) Introduce withColumnsRenamed
[ https://issues.apache.org/jira/browse/SPARK-40311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santosh Pingale updated SPARK-40311: Description: Add a scala, pyspark, R dataframe API that can rename multiple columns in a single command. This is mostly a performance related optimisations where users iteratively perform `withColumnRenamed`. With 100s columns and multiple iterations, there are cases where either driver will blow up or users will receive a StackOverflowError. {code:java} import datetime import numpy as np import pandas as pd num_rows = 2 num_columns = 100 data = np.zeros((num_rows, num_columns)) columns = map(str, range(num_columns)) raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) a = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") b = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") c = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") d = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") e = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") f = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") g = datetime.datetime.now() g-a datetime.timedelta(seconds=12, microseconds=480021) {code} {code:java} import datetime import numpy as np import pandas as pd num_rows = 2 num_columns = 100 data = np.zeros((num_rows, num_columns)) columns = map(str, range(num_columns)) raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) a = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) b = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) c = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) d = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) e = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) f = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) g = datetime.datetime.now() g-a datetime.timedelta(microseconds=632116) {code} was: Add a scala, pyspark, R dataframe API that can rename multiple columns in a single command. This is mostly a performance related optimisations where users iteratively perform `withColumnRenamed`. With 100s columns and multiple iterations, there are cases where either driver will blow up or users will receive a StackOverflowError. {code:java} import datetime import numpy as np import pandas as pd num_rows = 2 num_columns = 100 data = np.zeros((num_rows, num_columns)) columns = map(str, range(num_columns)) raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) a = datetime.datetime.now() {col, f"prefix_{col}" for col in raw.columns} for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") b = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") c = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") d = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") e = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") f = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") g = datetime.datetime.now() g-a datetime.timedelta(seconds=12, microseconds=480021) {code} {code:java} import datetime import numpy as np import pandas as pd num_rows = 2 num_columns = 100 data = np.zeros((num_rows, num_columns)) columns = map(str, range(num_columns)) raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) a = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) b = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) c = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) d = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) e = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) f = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}
[jira] [Assigned] (SPARK-40311) Introduce withColumnsRenamed
[ https://issues.apache.org/jira/browse/SPARK-40311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40311: Assignee: (was: Apache Spark) > Introduce withColumnsRenamed > > > Key: SPARK-40311 > URL: https://issues.apache.org/jira/browse/SPARK-40311 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR, SQL >Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2 >Reporter: Santosh Pingale >Priority: Minor > > Add a scala, pyspark, R dataframe API that can rename multiple columns in a > single command. This is mostly a performance related optimisations where > users iteratively perform `withColumnRenamed`. With 100s columns and multiple > iterations, there are cases where either driver will blow up or users will > receive a StackOverflowError. > {code:java} > import datetime > import numpy as np > import pandas as pd > num_rows = 2 > num_columns = 100 > data = np.zeros((num_rows, num_columns)) > columns = map(str, range(num_columns)) > raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) > a = datetime.datetime.now() > {col, f"prefix_{col}" for col in raw.columns} > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > b = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > c = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > d = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > e = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > f = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > g = datetime.datetime.now() > g-a > datetime.timedelta(seconds=12, microseconds=480021) {code} > {code:java} > import datetime > import numpy as np > import pandas as pd > num_rows = 2 > num_columns = 100 > data = np.zeros((num_rows, num_columns)) > columns = map(str, range(num_columns)) > raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) > a = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > b = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > c = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > d = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > e = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > f = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > g = datetime.datetime.now() > g-a > datetime.timedelta(microseconds=632116) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40311) Introduce withColumnsRenamed
[ https://issues.apache.org/jira/browse/SPARK-40311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599194#comment-17599194 ] Apache Spark commented on SPARK-40311: -- User 'santosh-d3vpl3x' has created a pull request for this issue: https://github.com/apache/spark/pull/37761 > Introduce withColumnsRenamed > > > Key: SPARK-40311 > URL: https://issues.apache.org/jira/browse/SPARK-40311 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR, SQL >Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.2.2 >Reporter: Santosh Pingale >Priority: Minor > > Add a scala, pyspark, R dataframe API that can rename multiple columns in a > single command. This is mostly a performance related optimisations where > users iteratively perform `withColumnRenamed`. With 100s columns and multiple > iterations, there are cases where either driver will blow up or users will > receive a StackOverflowError. > {code:java} > import datetime > import numpy as np > import pandas as pd > num_rows = 2 > num_columns = 100 > data = np.zeros((num_rows, num_columns)) > columns = map(str, range(num_columns)) > raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) > a = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > b = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > c = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > d = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > e = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > f = datetime.datetime.now() > for col in raw.columns: > raw = raw.withColumnRenamed(col, f"prefix_{col}") > g = datetime.datetime.now() > g-a > datetime.timedelta(seconds=12, microseconds=480021) {code} > {code:java} > import datetime > import numpy as np > import pandas as pd > num_rows = 2 > num_columns = 100 > data = np.zeros((num_rows, num_columns)) > columns = map(str, range(num_columns)) > raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) > a = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > b = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > c = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > d = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > e = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > f = datetime.datetime.now() > raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in > raw.columns}), spark) > g = datetime.datetime.now() > g-a > datetime.timedelta(microseconds=632116) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40311) Introduce withColumnsRenamed
[ https://issues.apache.org/jira/browse/SPARK-40311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santosh Pingale updated SPARK-40311: Description: Add a scala, pyspark, R dataframe API that can rename multiple columns in a single command. Issues are faced when users iteratively perform `withColumnRenamed`. * When it works, we see slower performace * In some cases, StackOverflowError is raised due to logical plan being too big * In a few cases, driver died due to memory consumption Some reproducible benchmarks: {code:java} import datetime import numpy as np import pandas as pd num_rows = 2 num_columns = 100 data = np.zeros((num_rows, num_columns)) columns = map(str, range(num_columns)) raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) a = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") b = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") c = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") d = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") e = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") f = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") g = datetime.datetime.now() g-a datetime.timedelta(seconds=12, microseconds=480021) {code} {code:java} import datetime import numpy as np import pandas as pd num_rows = 2 num_columns = 100 data = np.zeros((num_rows, num_columns)) columns = map(str, range(num_columns)) raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) a = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) b = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) c = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) d = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) e = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) f = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) g = datetime.datetime.now() g-a datetime.timedelta(microseconds=632116) {code} was: Add a scala, pyspark, R dataframe API that can rename multiple columns in a single command. This is mostly a performance related optimisations where users iteratively perform `withColumnRenamed`. With 100s columns and multiple iterations, there are cases where either driver will blow up or users will receive a StackOverflowError. {code:java} import datetime import numpy as np import pandas as pd num_rows = 2 num_columns = 100 data = np.zeros((num_rows, num_columns)) columns = map(str, range(num_columns)) raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) a = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") b = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") c = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") d = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") e = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") f = datetime.datetime.now() for col in raw.columns: raw = raw.withColumnRenamed(col, f"prefix_{col}") g = datetime.datetime.now() g-a datetime.timedelta(seconds=12, microseconds=480021) {code} {code:java} import datetime import numpy as np import pandas as pd num_rows = 2 num_columns = 100 data = np.zeros((num_rows, num_columns)) columns = map(str, range(num_columns)) raw = spark.createDataFrame(pd.DataFrame(data, columns=columns)) a = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) b = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) c = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) d = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) e = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spark) f = datetime.datetime.now() raw = DataFrame(raw._jdf.withColumnsRenamed({col: f"prefix_{col}" for col in raw.columns}), spar
[jira] [Updated] (SPARK-40309) Introduce sql_conf context manager for pyspark.sql
[ https://issues.apache.org/jira/browse/SPARK-40309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-40309: Labels: release-notes (was: ) > Introduce sql_conf context manager for pyspark.sql > -- > > Key: SPARK-40309 > URL: https://issues.apache.org/jira/browse/SPARK-40309 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > Labels: release-notes > > [https://github.com/apache/spark/blob/master/python/pyspark/pandas/utils.py#L490] > a context manager is introduced to set the Spark SQL configuration and > then restores it back when it exits, in Pandas API on Spark. > That simplifies the control of Spark SQL configuration, > from > {code:java} > original_value = spark.conf.get("key") > spark.conf.set("key", "value") > ... > spark.conf.set("key", original_value){code} > to > {code:java} > with sql_conf({"key": "value"}): > ... > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.
[ https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40288: Assignee: (was: Apache Spark) > After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should > applied to avoid attribute missing when use complex expression. > -- > > Key: SPARK-40288 > URL: https://issues.apache.org/jira/browse/SPARK-40288 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 > Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0 >Reporter: hgs >Priority: Minor > > {{--table}} > {{create}} {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} > {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}} > {{--data}} > {{insert}} {{overwrite }}{{table}} {{miss_expr > }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}} > {{--failure sql}} > {{insert}} {{overwrite }}{{table}} {{miss_expr}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n > }}{{{}from{}}}{{{}({}}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage > }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}} > {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}} > --error stack > {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in > [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) > 100 else 200#12]}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}} > {{at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.
[ https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40288: Assignee: Apache Spark > After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should > applied to avoid attribute missing when use complex expression. > -- > > Key: SPARK-40288 > URL: https://issues.apache.org/jira/browse/SPARK-40288 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 > Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0 >Reporter: hgs >Assignee: Apache Spark >Priority: Minor > > {{--table}} > {{create}} {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} > {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}} > {{--data}} > {{insert}} {{overwrite }}{{table}} {{miss_expr > }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}} > {{--failure sql}} > {{insert}} {{overwrite }}{{table}} {{miss_expr}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n > }}{{{}from{}}}{{{}({}}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage > }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}} > {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}} > --error stack > {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in > [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) > 100 else 200#12]}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}} > {{at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.
[ https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599242#comment-17599242 ] Apache Spark commented on SPARK-40288: -- User 'hgs19921112' has created a pull request for this issue: https://github.com/apache/spark/pull/37765 > After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should > applied to avoid attribute missing when use complex expression. > -- > > Key: SPARK-40288 > URL: https://issues.apache.org/jira/browse/SPARK-40288 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 > Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0 >Reporter: hgs >Priority: Minor > > {{--table}} > {{create}} {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} > {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}} > {{--data}} > {{insert}} {{overwrite }}{{table}} {{miss_expr > }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}} > {{--failure sql}} > {{insert}} {{overwrite }}{{table}} {{miss_expr}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n > }}{{{}from{}}}{{{}({}}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage > }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}} > {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}} > --error stack > {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in > [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) > 100 else 200#12]}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}} > {{at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.
[ https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599243#comment-17599243 ] Apache Spark commented on SPARK-40288: -- User 'hgs19921112' has created a pull request for this issue: https://github.com/apache/spark/pull/37765 > After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should > applied to avoid attribute missing when use complex expression. > -- > > Key: SPARK-40288 > URL: https://issues.apache.org/jira/browse/SPARK-40288 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 > Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0 >Reporter: hgs >Priority: Minor > > {{--table}} > {{create}} {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} > {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}} > {{--data}} > {{insert}} {{overwrite }}{{table}} {{miss_expr > }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}} > {{--failure sql}} > {{insert}} {{overwrite }}{{table}} {{miss_expr}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n > }}{{{}from{}}}{{{}({}}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage > }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}} > {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}} > --error stack > {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in > [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) > 100 else 200#12]}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}} > {{at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.
[ https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599249#comment-17599249 ] Apache Spark commented on SPARK-40288: -- User 'hgs19921112' has created a pull request for this issue: https://github.com/apache/spark/pull/37766 > After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should > applied to avoid attribute missing when use complex expression. > -- > > Key: SPARK-40288 > URL: https://issues.apache.org/jira/browse/SPARK-40288 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 > Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0 >Reporter: hgs >Priority: Minor > > {{--table}} > {{create}} {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} > {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}} > {{--data}} > {{insert}} {{overwrite }}{{table}} {{miss_expr > }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}} > {{--failure sql}} > {{insert}} {{overwrite }}{{table}} {{miss_expr}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n > }}{{{}from{}}}{{{}({}}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage > }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}} > {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}} > --error stack > {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in > [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) > 100 else 200#12]}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}} > {{at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40288) After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should applied to avoid attribute missing when use complex expression.
[ https://issues.apache.org/jira/browse/SPARK-40288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599251#comment-17599251 ] Apache Spark commented on SPARK-40288: -- User 'hgs19921112' has created a pull request for this issue: https://github.com/apache/spark/pull/37766 > After `RemoveRedundantAggregates`, `PullOutGroupingExpressions` should > applied to avoid attribute missing when use complex expression. > -- > > Key: SPARK-40288 > URL: https://issues.apache.org/jira/browse/SPARK-40288 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.3.0 > Environment: spark 3.2.0 spark 3.2.2 spark 3.3.0 >Reporter: hgs >Priority: Minor > > {{--table}} > {{create}} {{table}} {{miss_expr(id }}{{{}int{}}}{{{},{}}}{{{}name{}}} > {{string,age }}{{{}double{}}}{{{}) stored {}}}{{as}} {{textfile}} > {{--data}} > {{insert}} {{overwrite }}{{table}} {{miss_expr > }}{{{}values{}}}{{{}(1,{}}}{{{}'ox'{}}}{{{},1.0),(1,{}}}{{{}'oox'{}}}{{{},2.0),(2,{}}}{{{}'ox'{}}}{{{},3.0),(2,{}}}{{{}'xxo'{}}}{{{},4.0){}}} > {{--failure sql}} > {{insert}} {{overwrite }}{{table}} {{miss_expr}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},nage {}}}{{as}} {{n > }}{{{}from{}}}{{{}({}}} > {{select}} {{{}id,{}}}{{{}name{}}}{{{},if(age>3,100,200) {}}}{{as}} {{nage > }}{{from}} {{miss_expr }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},age{}}} > {{) }}{{group}} {{by}} {{{}id,{}}}{{{}name{}}}{{{},nage{}}} > --error stack > {{Caused by: java.lang.IllegalStateException: Couldn't find age#4 in > [id#2,name#3,if ((age#4 > 3.0)) 100 else 200#12|#2,name#3,if ((age#4 > 3.0)) > 100 else 200#12]}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)}} > {{at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)}} > {{at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40303) The performance will be worse after codegen
[ https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599254#comment-17599254 ] Yang Jie commented on SPARK-40303: -- Run use Java 8 with `-XX:+PrintCompilation`, I found the following logs: {code:java} 64350 21862 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) COMPILE SKIPPED: unsupported incoming calling sequence (not retryable) 137245 22141 % 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) COMPILE SKIPPED: unsupported calling sequence (not retryable) {code} > The performance will be worse after codegen > --- > > Key: SPARK-40303 > URL: https://issues.apache.org/jira/browse/SPARK-40303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > import org.apache.spark.benchmark.Benchmark > val dir = "/tmp/spark/benchmark" > val N = 200 > val columns = Range(0, 100).map(i => s"id % $i AS id$i") > spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir) > // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100) > Seq(60).foreach{ cnt => > val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => > s"count(distinct $c)") > val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = > 1) > benchmark.addCase(s"$cnt count distinct with codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "true", > "spark.sql.codegen.factoryMode" -> "FALLBACK") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.addCase(s"$cnt count distinct without codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "false", > "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.run() > } > {code} > {noformat} > Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7 > Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > Benchmark count distinct: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > 60 count distinct with codegen 628146 628146 >0 0.0 314072.8 1.0X > 60 count distinct without codegen147635 147635 >0 0.0 73817.5 4.3X > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40303) The performance will be worse after codegen
[ https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599254#comment-17599254 ] Yang Jie edited comment on SPARK-40303 at 9/2/22 4:49 AM: -- Run use Java 8 with `-XX:+PrintCompilation`, I found the following logs: {code:java} 64350 21862 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) COMPILE SKIPPED: unsupported incoming calling sequence (not retryable) 137245 22141 % 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) COMPILE SKIPPED: unsupported calling sequence (not retryable) {code} [~rednaxelafx] In this case, will compilation optimization degenerate to Level 2? was (Author: luciferyang): Run use Java 8 with `-XX:+PrintCompilation`, I found the following logs: {code:java} 64350 21862 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) COMPILE SKIPPED: unsupported incoming calling sequence (not retryable) 137245 22141 % 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) COMPILE SKIPPED: unsupported calling sequence (not retryable) {code} > The performance will be worse after codegen > --- > > Key: SPARK-40303 > URL: https://issues.apache.org/jira/browse/SPARK-40303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > import org.apache.spark.benchmark.Benchmark > val dir = "/tmp/spark/benchmark" > val N = 200 > val columns = Range(0, 100).map(i => s"id % $i AS id$i") > spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir) > // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100) > Seq(60).foreach{ cnt => > val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => > s"count(distinct $c)") > val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = > 1) > benchmark.addCase(s"$cnt count distinct with codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "true", > "spark.sql.codegen.factoryMode" -> "FALLBACK") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.addCase(s"$cnt count distinct without codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "false", > "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.run() > } > {code} > {noformat} > Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7 > Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > Benchmark count distinct: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > 60 count distinct with codegen 628146 628146 >0 0.0 314072.8 1.0X > 60 count distinct without codegen147635 147635 >0 0.0 73817.5 4.3X > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39284) Implement Groupby.mad
[ https://issues.apache.org/jira/browse/SPARK-39284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599258#comment-17599258 ] Apache Spark commented on SPARK-39284: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37767 > Implement Groupby.mad > - > > Key: SPARK-39284 > URL: https://issues.apache.org/jira/browse/SPARK-39284 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40303) The performance will be worse after codegen
[ https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599254#comment-17599254 ] Yang Jie edited comment on SPARK-40303 at 9/2/22 4:56 AM: -- Run use Java 8 with `-XX:+PrintCompilation`, I found the following logs: {code:java} 64350 21862 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) COMPILE SKIPPED: unsupported incoming calling sequence (not retryable) 137245 22141 % 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) COMPILE SKIPPED: unsupported calling sequence (not retryable) {code} [~rednaxelafx] In this case, will compilation optimization degenerate to Level 2 or give up? was (Author: luciferyang): Run use Java 8 with `-XX:+PrintCompilation`, I found the following logs: {code:java} 64350 21862 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) COMPILE SKIPPED: unsupported incoming calling sequence (not retryable) 137245 22141 % 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) COMPILE SKIPPED: unsupported calling sequence (not retryable) {code} [~rednaxelafx] In this case, will compilation optimization degenerate to Level 2? > The performance will be worse after codegen > --- > > Key: SPARK-40303 > URL: https://issues.apache.org/jira/browse/SPARK-40303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > import org.apache.spark.benchmark.Benchmark > val dir = "/tmp/spark/benchmark" > val N = 200 > val columns = Range(0, 100).map(i => s"id % $i AS id$i") > spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir) > // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100) > Seq(60).foreach{ cnt => > val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => > s"count(distinct $c)") > val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = > 1) > benchmark.addCase(s"$cnt count distinct with codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "true", > "spark.sql.codegen.factoryMode" -> "FALLBACK") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.addCase(s"$cnt count distinct without codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "false", > "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.run() > } > {code} > {noformat} > Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7 > Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > Benchmark count distinct: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > 60 count distinct with codegen 628146 628146 >0 0.0 314072.8 1.0X > 60 count distinct without codegen147635 147635 >0 0.0 73817.5 4.3X > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40303) The performance will be worse after codegen
[ https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599254#comment-17599254 ] Yang Jie edited comment on SPARK-40303 at 9/2/22 4:58 AM: -- Run use Java 8 with `-XX:+PrintCompilation`, I found the following logs: {code:java} 64350 21862 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) COMPILE SKIPPED: unsupported incoming calling sequence (not retryable) 137245 22141 % 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) COMPILE SKIPPED: unsupported calling sequence (not retryable) {code} [~rednaxelafx] In this case, will compilation optimization degenerate to Level 2 or give up when use Java 8? was (Author: luciferyang): Run use Java 8 with `-XX:+PrintCompilation`, I found the following logs: {code:java} 64350 21862 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) COMPILE SKIPPED: unsupported incoming calling sequence (not retryable) 137245 22141 % 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) COMPILE SKIPPED: unsupported calling sequence (not retryable) {code} [~rednaxelafx] In this case, will compilation optimization degenerate to Level 2 or give up? > The performance will be worse after codegen > --- > > Key: SPARK-40303 > URL: https://issues.apache.org/jira/browse/SPARK-40303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > import org.apache.spark.benchmark.Benchmark > val dir = "/tmp/spark/benchmark" > val N = 200 > val columns = Range(0, 100).map(i => s"id % $i AS id$i") > spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir) > // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100) > Seq(60).foreach{ cnt => > val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => > s"count(distinct $c)") > val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = > 1) > benchmark.addCase(s"$cnt count distinct with codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "true", > "spark.sql.codegen.factoryMode" -> "FALLBACK") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.addCase(s"$cnt count distinct without codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "false", > "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.run() > } > {code} > {noformat} > Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7 > Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > Benchmark count distinct: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > 60 count distinct with codegen 628146 628146 >0 0.0 314072.8 1.0X > 60 count distinct without codegen147635 147635 >0 0.0 73817.5 4.3X > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-40303) The performance will be worse after codegen
[ https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599254#comment-17599254 ] Yang Jie edited comment on SPARK-40303 at 9/2/22 4:59 AM: -- Run use Java 8 with `-XX:+PrintCompilation`, I found the following logs: {code:java} 64350 21862 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) COMPILE SKIPPED: unsupported incoming calling sequence (not retryable) 137245 22141 % 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) COMPILE SKIPPED: unsupported calling sequence (not retryable) {code} [~rednaxelafx] In this case, will compilation optimization degenerate to Level 3 or give up when use Java 8? was (Author: luciferyang): Run use Java 8 with `-XX:+PrintCompilation`, I found the following logs: {code:java} 64350 21862 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) COMPILE SKIPPED: unsupported incoming calling sequence (not retryable) 137245 22141 % 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) COMPILE SKIPPED: unsupported calling sequence (not retryable) {code} [~rednaxelafx] In this case, will compilation optimization degenerate to Level 2 or give up when use Java 8? > The performance will be worse after codegen > --- > > Key: SPARK-40303 > URL: https://issues.apache.org/jira/browse/SPARK-40303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > import org.apache.spark.benchmark.Benchmark > val dir = "/tmp/spark/benchmark" > val N = 200 > val columns = Range(0, 100).map(i => s"id % $i AS id$i") > spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir) > // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100) > Seq(60).foreach{ cnt => > val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => > s"count(distinct $c)") > val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = > 1) > benchmark.addCase(s"$cnt count distinct with codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "true", > "spark.sql.codegen.factoryMode" -> "FALLBACK") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.addCase(s"$cnt count distinct without codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "false", > "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.run() > } > {code} > {noformat} > Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7 > Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > Benchmark count distinct: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > 60 count distinct with codegen 628146 628146 >0 0.0 314072.8 1.0X > 60 count distinct without codegen147635 147635 >0 0.0 73817.5 4.3X > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40303) The performance will be worse after codegen
[ https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599262#comment-17599262 ] Yang Jie commented on SPARK-40303: -- {code:java} 64265 21820 3 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) 64308 21862 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) 64350 21862 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) COMPILE SKIPPED: unsupported incoming calling sequence (not retryable) {code} {code:java} 103941 22065 % 3 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) 104602 22067 3 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ (2712 bytes) 135979 22141 % 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) 137245 22141 % 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) COMPILE SKIPPED: unsupported calling sequence (not retryable) {code} The compilation relevant logs are as above when Java 8 is used > The performance will be worse after codegen > --- > > Key: SPARK-40303 > URL: https://issues.apache.org/jira/browse/SPARK-40303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > import org.apache.spark.benchmark.Benchmark > val dir = "/tmp/spark/benchmark" > val N = 200 > val columns = Range(0, 100).map(i => s"id % $i AS id$i") > spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir) > // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100) > Seq(60).foreach{ cnt => > val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => > s"count(distinct $c)") > val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = > 1) > benchmark.addCase(s"$cnt count distinct with codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "true", > "spark.sql.codegen.factoryMode" -> "FALLBACK") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.addCase(s"$cnt count distinct without codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "false", > "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.run() > } > {code} > {noformat} > Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7 > Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > Benchmark count distinct: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > 60 count distinct with codegen 628146 628146 >0 0.0 314072.8 1.0X > 60 count distinct without codegen147635 147635 >0 0.0 73817.5 4.3X > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40303) The performance will be worse after codegen
[ https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599263#comment-17599263 ] Yang Jie commented on SPARK-40303: -- If you run with Java 17, the performance gap will be smaller. The compilation logs of `hashAgg_doConsume_0` and `hashAgg_doAggregateWithKeys_0` as follows: {code:java} 102158 22568 3 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) 102180 22606 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) 102228 22606 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) COMPILE SKIPPED: unsupported incoming calling sequence (retry at different tier) 102228 22619 1 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) 102240 22568 3 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) made not entrant 218296 24067 3 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2::hashAgg_doConsume_0$ (2052 bytes) 218463 22568 3 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) made zombie {code} {code:java} 105832 22708 % 3 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) 105955 22709 3 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ (2712 bytes) 108247 22741 % 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) 108484 22741 % 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) COMPILE SKIPPED: unsupported calling sequence (retry at different tier) 108727 22708 % 3 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) made not entrant 108727 22743 % 1 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) 218463 22708 % 3 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) made zombie {code} > The performance will be worse after codegen > --- > > Key: SPARK-40303 > URL: https://issues.apache.org/jira/browse/SPARK-40303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > import org.apache.spark.benchmark.Benchmark > val dir = "/tmp/spark/benchmark" > val N = 200 > val columns = Range(0, 100).map(i => s"id % $i AS id$i") > spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir) > // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100) > Seq(60).foreach{ cnt => > val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => > s"count(distinct $c)") > val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = > 1) > benchmark.addCase(s"$cnt count distinct with codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "true", > "spark.sql.codegen.factoryMode" -> "FALLBACK") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.addCase(s"$cnt count distinct without codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "false", > "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.run() > } > {code} > {noformat} > Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7 > Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > Benchmark count distinct: Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > 60 count distinct with codegen 628146 628146 >0 0.0 314072.8
[jira] [Comment Edited] (SPARK-40303) The performance will be worse after codegen
[ https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17599262#comment-17599262 ] Yang Jie edited comment on SPARK-40303 at 9/2/22 5:15 AM: -- {code:java} 64265 21820 3 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) 64308 21862 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) 64350 21862 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) COMPILE SKIPPED: unsupported incoming calling sequence (not retryable) 1735350 24101 3 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2::hashAgg_doConsume_0$ (2052 bytes){code} {code:java} 103941 22065 % 3 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) 104602 22067 3 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ (2712 bytes) 135979 22141 % 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) 137245 22141 % 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) COMPILE SKIPPED: unsupported calling sequence (not retryable) {code} The compilation relevant logs are as above when Java 8 is used was (Author: luciferyang): {code:java} 64265 21820 3 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) 64308 21862 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) 64350 21862 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doConsume_0$ (2051 bytes) COMPILE SKIPPED: unsupported incoming calling sequence (not retryable) {code} {code:java} 103941 22065 % 3 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) 104602 22067 3 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ (2712 bytes) 135979 22141 % 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) 137245 22141 % 4 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1::hashAgg_doAggregateWithKeys_0$ @ 38 (2712 bytes) COMPILE SKIPPED: unsupported calling sequence (not retryable) {code} The compilation relevant logs are as above when Java 8 is used > The performance will be worse after codegen > --- > > Key: SPARK-40303 > URL: https://issues.apache.org/jira/browse/SPARK-40303 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > import org.apache.spark.benchmark.Benchmark > val dir = "/tmp/spark/benchmark" > val N = 200 > val columns = Range(0, 100).map(i => s"id % $i AS id$i") > spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir) > // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100) > Seq(60).foreach{ cnt => > val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => > s"count(distinct $c)") > val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = > 1) > benchmark.addCase(s"$cnt count distinct with codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "true", > "spark.sql.codegen.factoryMode" -> "FALLBACK") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.addCase(s"$cnt count distinct without codegen") { _ => > withSQLConf( > "spark.sql.codegen.wholeStage" -> "false", > "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") { > spark.read.parquet(dir).selectExpr(selectExps: > _*).write.format("noop").mode("Overwrite").save() > } > } > benchmark.run() > } > {code} > {noformat} > Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7 > Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz > Benchmark count distinct: Best Time(ms) Avg Time(m