Re: [PR] [SPARK-42669][CONNECT] Short circuit local relation RPCs [spark]
beliefer commented on PR #40782: URL: https://github.com/apache/spark/pull/40782#issuecomment-1762631600 @hvanhovell Do we need this change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-42669][CONNECT] Short circuit local relation RPCs [spark]
beliefer opened a new pull request, #40782: URL: https://github.com/apache/spark/pull/40782 ### What changes were proposed in this pull request? Operations on `LocalRelation` can mostly be done locally (without sending RPCs). We should leverage this. ### Why are the changes needed? Avoid sending RPCs for `LocalRelation`. ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? Exists test cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-42746][SQL] Add the LISTAGG() aggregate function [spark]
beliefer commented on code in PR #42398: URL: https://github.com/apache/spark/pull/42398#discussion_r1359157402 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ListAgg.scala: ## @@ -0,0 +1,109 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.trees.UnaryLike +import org.apache.spark.sql.catalyst.types.PhysicalDataType +import org.apache.spark.sql.types.{DataType, StringType} +import org.apache.spark.unsafe.types.UTF8String +import org.apache.spark.util.collection.OpenHashMap + +@ExpressionDescription( + usage = "_FUNC_(expr) - Returns the concatenated input values," + +" separated by the delimiter string.", + examples = """ +Examples: + > SELECT _FUNC_(col) FROM VALUES ('a'), ('b'), ('c') AS tab(col); + a,b,c + > SELECT _FUNC_(col) FROM VALUES (NULL), ('a'), ('b') AS tab(col); + a,b + > SELECT _FUNC_(col) FROM VALUES ('a'), ('a') AS tab(col); + a,a + > SELECT _FUNC_(DISTINCT col) FROM VALUES ('a'), ('a'), ('b') AS tab(col); + a,b + > SELECT _FUNC_(col, '|') FROM VALUES ('a'), ('b') AS tab(col); + a|b + > SELECT _FUNC_(col) FROM VALUES (NULL), (NULL) AS tab(col); + NULL + """, + group = "agg_funcs", + since = "4.0.0") +case class ListAgg( +child: Expression, +delimiter: Expression = Literal.create(",", StringType), +reverse: Boolean = false, +mutableAggBufferOffset: Int = 0, +inputAggBufferOffset: Int = 0) extends TypedAggregateWithHashMapAsBuffer + with UnaryLike[Expression] { + + def this(child: Expression) = this(child, Literal.create(",", StringType), false, 0, 0) + def this(child: Expression, delimiter: Expression) = this(child, delimiter, false, 0, 0) + + override def update( + buffer: OpenHashMap[AnyRef, Long], Review Comment: You can composite them. Please refer `PercentileCont` or `RegrSlope`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44837][SQL] Improve ALTER TABLE ALTER PARTITION column error message [spark]
beliefer commented on code in PR #42524: URL: https://github.com/apache/spark/pull/42524#discussion_r1359145115 ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala: ## @@ -377,8 +377,13 @@ case class AlterTableChangeColumnCommand( val resolver = sparkSession.sessionState.conf.resolver DDLUtils.verifyAlterTableType(catalog, table, isView = false) -// Find the origin column from dataSchema by column name. +// check that the column is not a partition column Review Comment: `check` -> `Check` ## sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala: ## @@ -377,8 +377,13 @@ case class AlterTableChangeColumnCommand( val resolver = sparkSession.sessionState.conf.resolver DDLUtils.verifyAlterTableType(catalog, table, isView = false) -// Find the origin column from dataSchema by column name. +// check that the column is not a partition column +if (findColumnByName(table.partitionSchema, columnName, resolver).isDefined) { Review Comment: How about `table.partitionSchema.fieldNames.exists(resolver(_, columnName))` ? So we no need change `findColumnByName` and related code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44649][SQL] Runtime Filter supports passing equivalent creation side expressions [spark]
beliefer commented on PR #42317: URL: https://github.com/apache/spark/pull/42317#issuecomment-1762596244 The GA failure is unrelated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45531][SQL][DOCS] Add more comments and rename some variable name for InjectRuntimeFilter [spark]
beliefer commented on PR #43359: URL: https://github.com/apache/spark/pull/43359#issuecomment-1762593379 > what variables did you rename? Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45530][CORE] Use `java.lang.ref.Cleaner` instead of `finalize` for `NioBufferedFileInputStream` [spark]
mridulm commented on PR #4: URL: https://github.com/apache/spark/pull/4#issuecomment-1762592047 Looks good to me, thanks for fixing this @LuciferYang ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44307][SQL] Add Bloom filter for left outer join even if the left side table is smaller than broadcast threshold. [spark]
beliefer commented on code in PR #41860: URL: https://github.com/apache/spark/pull/41860#discussion_r1359115220 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala: ## @@ -322,9 +332,6 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with PredicateHelper with J isSimpleExpression(l) && isSimpleExpression(r)) { val oldLeft = newLeft val oldRight = newRight -// Check if the current join is a shuffle join or a broadcast join that -// has a shuffle below it -val hasShuffle = isProbablyShuffleJoin(left, right, hint) Review Comment: Shall we not move this line? Although move it to line 323 is also correct. We should do it lazily. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [MINOR][CORE] Add `@Deprecated` for `SparkLauncher#DEPRECATED_CHILD_CONNECTION_TIMEOUT` [spark]
LuciferYang opened a new pull request, #43374: URL: https://github.com/apache/spark/pull/43374 ### What changes were proposed in this pull request? This pr just add `@Deprecated` for `SparkLauncher#DEPRECATED_CHILD_CONNECTION_TIMEOUT`. ### Why are the changes needed? From the javadoc, `DEPRECATED_CHILD_CONNECTION_TIMEOUT` has been deprecated, so it should carry a `@Deprecated` marker. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45534] Use java.lang.ref.Cleaner instead of finalize for RemoteBlockPushResolver [spark]
beliefer commented on code in PR #43371: URL: https://github.com/apache/spark/pull/43371#discussion_r1359110742 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -1933,6 +1933,39 @@ public RoaringBitmap getMapTracker() { int getNumIOExceptions() { return numIOExceptions; } + +private record ResourceCleaner( +File dataFile, +FileChannel dataChannel, +MergeShuffleFile indexFile, +MergeShuffleFile metaFile, +AppAttemptShuffleMergeId appAttemptShuffleMergeId, +int reduceId) implements Runnable { + + @Override + public void run() { +try { Review Comment: +1 reuse `AppShufflePartitionInfo#closeAllFilesAndDeleteIfNeeded`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45484][SQL][3.5] Deprecated the incorrect parquet compression codec lz4raw [spark]
beliefer commented on PR #43330: URL: https://github.com/apache/spark/pull/43330#issuecomment-1762547574 cc @dongjoon-hyun @wangyum -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45484][SQL][3.5] Deprecated the incorrect parquet compression codec lz4raw [spark]
beliefer commented on PR #43330: URL: https://github.com/apache/spark/pull/43330#issuecomment-1762547364 The GA failure is unrelated to this PR. `Linters, licenses, dependencies and documentation generation` ``` /usr/lib/ruby/2.7.0/fileutils.rb:1415:in `copy_stream': No space left on device - copy_file_range (Errno::ENOSPC) [12872](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12872) from /usr/lib/ruby/2.7.0/fileutils.rb:1415:in `block (2 levels) in copy_file' [12873](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12873) from /usr/lib/ruby/2.7.0/fileutils.rb:1414:in `open' [12874](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12874) from /usr/lib/ruby/2.7.0/fileutils.rb:1414:in `block in copy_file' [12875](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12875) from /usr/lib/ruby/2.7.0/fileutils.rb:1413:in `open' [12876](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12876) from /usr/lib/ruby/2.7.0/fileutils.rb:1413:in `copy_file' [12877](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12877) from /usr/lib/ruby/2.7.0/fileutils.rb:1378:in `copy' [12879](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12879) Error: No space left on device : '/home/runner/runners/2.309.0/_diag/pages/b7db6e67-b9c1-4011-a807-e832ba0cf437_fef911b0-771d-54c9-1f3c-e23e2c04b8fb_1.log' ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45427][CORE] Add RPC SSL settings to SSLOptions and SparkTransportConf [spark]
hasnain-db commented on PR #43238: URL: https://github.com/apache/spark/pull/43238#issuecomment-1762546624 `org.apache.spark.sql.kafka010.KafkaSourceStressSuite` is failing on this and the other outstanding PR, and seems to be unrelated to the changes here. I wonder if that is related to the new kafka version bump since it looks like it also failed on https://github.com/apache/spark/actions/runs/6497942586/job/17648192318 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45429][CORE] Add helper classes for SSL RPC communication [spark]
hasnain-db commented on PR #43244: URL: https://github.com/apache/spark/pull/43244#issuecomment-1762546614 `org.apache.spark.sql.kafka010.KafkaSourceStressSuite` is failing on this and the other outstanding PR, and seems to be unrelated to the changes here. I wonder if that is related to the new kafka version bump since it looks like it also failed on https://github.com/apache/spark/actions/runs/6497942586/job/17648192318 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45513][CORE][SQL][MLLIB][CONNECT] Replace `scala.runtime.Tuple2Zipped` to `scala.collection.LazyZip2` [spark]
beliefer commented on PR #43351: URL: https://github.com/apache/spark/pull/43351#issuecomment-1762545714 @LuciferYang @srowen Thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45514][SQL][MLLIB] Replace `scala.runtime.Tuple3Zipped` to `scala.collection.LazyZip3` [spark]
beliefer commented on PR #43363: URL: https://github.com/apache/spark/pull/43363#issuecomment-1762545286 The GA failure is unrelated to this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45514][SQL][MLLIB] Replace `scala.runtime.Tuple3Zipped` to `scala.collection.LazyZip3` [spark]
beliefer commented on PR #43363: URL: https://github.com/apache/spark/pull/43363#issuecomment-1762545070 cc @HyukjinKwon @srowen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45513][CORE][SQL][MLLIB][CONNECT] Replace `scala.runtime.Tuple2Zipped` to `scala.collection.LazyZip2` [spark]
beliefer closed pull request #43351: [SPARK-45513][CORE][SQL][MLLIB][CONNECT] Replace `scala.runtime.Tuple2Zipped` to `scala.collection.LazyZip2` URL: https://github.com/apache/spark/pull/43351 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44527][SQL] Replace ScalarSubquery with null if its maxRows is 0 [spark]
wangyum commented on code in PR #42129: URL: https://github.com/apache/spark/pull/42129#discussion_r1359102724 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala: ## @@ -88,6 +88,10 @@ object ConstantFolding extends Rule[LogicalPlan] { e } +// Replace ScalarSubquery with null if its maxRows is 0 +case s: ScalarSubquery if s.plan.maxRows.contains(0) => Review Comment: @jchen5 Do we need to consider `mayHaveCountBug`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44210][CONNECT][SQL][PYTHON] Strengthen type checking and better comply with Connect specifications for `levenshtein` function [spark]
LuciferYang commented on PR #41724: URL: https://github.com/apache/spark/pull/41724#issuecomment-1762543078 Don't we need this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45530][CORE] Use `java.lang.ref.Cleaner` instead of `finalize` for `NioBufferedFileInputStream` [spark]
LuciferYang commented on code in PR #4: URL: https://github.com/apache/spark/pull/4#discussion_r1359102252 ## core/src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java: ## @@ -32,8 +34,11 @@ */ public final class NioBufferedFileInputStream extends InputStream { + private static final Cleaner CLEANER = Cleaner.create(); Review Comment: [a9f3025](https://github.com/apache/spark/pull/4/commits/a9f3025afb2381f573ee0c3e146da9d0b704593b) rename this from `cleaner` to `CLEANER` due to it's static -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-44662] Perf improvement in BroadcastHashJoin queries with stream side join key on non partition columns [spark]
ahshahid opened a new pull request, #43373: URL: https://github.com/apache/spark/pull/43373 What changes were proposed in this pull request? On the lines of DPP which helps DataSourceV2 relations when the joining key is a partition column, the same concept can be extended over to the case where joining key is not a partition column. In this PR, the keys available in the BroadcastHashJoinExec are pushed down to the DataSourceV2 scans in form of a SortedSet structure. For non partition columns, the DataSources like iceberg have max/min stats on columns available at manifest level, and for formats like parquet , they have max/min stats at various data structure levels. The passed SortedSet can be used to prune using ranges at both driver level ( manifests files) as well as executor level ( while actually going through chunks , row groups etc at parquet level) If the data is stored as Columnar Batch format , then it would not be possible to filter out individual row at DataSource level, even though we have keys. But at the scan level, ( ColumnToRowExec) it is still possible to filter out as many rows as possible , if the query involves nested joins. Thus reducing the number of rows to join at the higher join levels. Attaching link to a presentation which outlines the idea: [Broadcast Keys pushdown](https://docs.google.com/presentation/d/165Rx7i00TmAKnDJpSQLfrcrW-ShrzPy5/edit?usp=drive_link) SPIP : [SPIP-44662](https://issues.apache.org/jira/browse/SPARK-44662) Why are the changes needed? There is scope of improvement in the performance of Inner and Left Semi join queries when using BroadcastHashJoin Does this PR introduce any user-facing change? No How was this patch tested? Ran TPCDS suite using iceberg as DataSource. Converted many of the existing Spark Query tests to also run using iceberg as data source. Will be adding more unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45534] Use java.lang.ref.Cleaner instead of finalize for RemoteBlockPushResolver [spark]
LuciferYang commented on code in PR #43371: URL: https://github.com/apache/spark/pull/43371#discussion_r1359100137 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -1933,6 +1933,39 @@ public RoaringBitmap getMapTracker() { int getNumIOExceptions() { return numIOExceptions; } + +private record ResourceCleaner( +File dataFile, +FileChannel dataChannel, +MergeShuffleFile indexFile, +MergeShuffleFile metaFile, +AppAttemptShuffleMergeId appAttemptShuffleMergeId, +int reduceId) implements Runnable { + + @Override + public void run() { +try { Review Comment: My only question is that this introduces some duplicate code with the `AppShufflePartitionInfo#closeAllFilesAndDeleteIfNeeded` method. Is there really no way to eliminate this duplication? ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -1756,6 +1759,8 @@ public static class AppShufflePartitionInfo { this.dataFilePos = 0; this.mapTracker = new RoaringBitmap(); this.chunkTracker = new RoaringBitmap(); + CLEANER.register(this, new ResourceCleaner(dataFile, dataChannel, indexFile, + metaFile, appAttemptShuffleMergeId, reduceId)); Review Comment: Indentation: 2 spaces -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Add Support for Scala 2.13 in Spark 3.4.1 [spark-docker]
Yikun commented on code in PR #52: URL: https://github.com/apache/spark-docker/pull/52#discussion_r1359079892 ## versions.json: ## @@ -40,18 +40,41 @@ "3.4.1" ] }, +{ + "path": "3.4.1/scala2.13-java11-python3-ubuntu", + "tags": [ +"3.4.1-scala2.13-java11-python3-ubuntu" + ] +}, { "path": "3.4.1/scala2.12-java11-r-ubuntu", "tags": [ "3.4.1-scala2.12-java11-r-ubuntu", "3.4.1-r" ] }, +{ + "path": "3.4.1/scala2.13-java11-r-ubuntu", + "tags": [ +"3.4.1-scala2.13-java11-r-ubuntu" + ] +}, { "path": "3.4.1/scala2.12-java11-ubuntu", "tags": [ "3.4.1-scala2.12-java11-ubuntu", -"3.4.1-scala" +"3.4.1-scala2.12", +"scala2.12", +"3.4.1-scala", +"scala" + ] +}, +{ + "path": "3.4.1/scala2.13-java11-ubuntu", + "tags": [ +"3.4.1-scala2.13-java11-ubuntu", +"3.4.1-scala2.13", +"scala2.13" Review Comment: Yes, agree! As I mentioned above comment, 2.13 better to introduce in latest version (3.5.0 for now). https://github.com/apache/spark-docker/pull/52#pullrequestreview-1676547189 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Add Support for Scala 2.13 in Spark 3.4.1 [spark-docker]
Yikun commented on code in PR #52: URL: https://github.com/apache/spark-docker/pull/52#discussion_r1359079892 ## versions.json: ## @@ -40,18 +40,41 @@ "3.4.1" ] }, +{ + "path": "3.4.1/scala2.13-java11-python3-ubuntu", + "tags": [ +"3.4.1-scala2.13-java11-python3-ubuntu" + ] +}, { "path": "3.4.1/scala2.12-java11-r-ubuntu", "tags": [ "3.4.1-scala2.12-java11-r-ubuntu", "3.4.1-r" ] }, +{ + "path": "3.4.1/scala2.13-java11-r-ubuntu", + "tags": [ +"3.4.1-scala2.13-java11-r-ubuntu" + ] +}, { "path": "3.4.1/scala2.12-java11-ubuntu", "tags": [ "3.4.1-scala2.12-java11-ubuntu", -"3.4.1-scala" +"3.4.1-scala2.12", +"scala2.12", +"3.4.1-scala", +"scala" + ] +}, +{ + "path": "3.4.1/scala2.13-java11-ubuntu", + "tags": [ +"3.4.1-scala2.13-java11-ubuntu", +"3.4.1-scala2.13", +"scala2.13" Review Comment: Yes, agree! As I mentioned above comment, 2.13 better to introduce in latest versio (3.5.0 for now) https://github.com/apache/spark-docker/pull/52#pullrequestreview-1676547189 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45530][CORE] Use `java.lang.ref.Cleaner` instead of `finalize` for `NioBufferedFileInputStream` [spark]
LuciferYang commented on PR #4: URL: https://github.com/apache/spark/pull/4#issuecomment-1762516078 ``` [info] *** 1 TEST FAILED *** [error] Failed tests: [error] org.apache.spark.sql.kafka010.KafkaSourceStressSuite [warn] In the last 2048 seconds, 6.951 (0.3%) were spent in GC. [Heap: 3.23GB free of 4.00GB, max 4.00GB] Consider increasing the JVM heap using `-Xmx` or try a different collector, e.g. `-XX:+UseG1GC`, for better performance. [error] (sql-kafka-0-10 / Test / test) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 2049 s (34:09), completed Oct 13, 2023, 4:05:45 PM ``` This is a case that started failing after upgrading to Kafka 3.6, the author is fixing it, it's unrelated to this PR. Could you take another look at this pr If you have time, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Revert "[SPARK-45536][BUILD] Lower the default `-Xmx` of `build/mvn` to 3g" [spark]
LuciferYang commented on PR #43372: URL: https://github.com/apache/spark/pull/43372#issuecomment-1762514978 cc @HyukjinKwon @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] Revert "[SPARK-45536][BUILD] Lower the default `-Xmx` of `build/mvn` to 3g" [spark]
LuciferYang opened a new pull request, #43372: URL: https://github.com/apache/spark/pull/43372 This reverts commit 3e2470de7ea8b97dcdd8875ef25f044998fb7588. ### What changes were proposed in this pull request? This pr revert change of https://github.com/apache/spark/pull/43364. ### Why are the changes needed? It seems to have no effect on fixing `Publish snapshot`, it still failed - https://github.com/apache/spark/actions/runs/6514229181/job/17696846279 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45536][BUILD] Lower the default `-Xmx` of `build/mvn` to 3g [spark]
LuciferYang commented on PR #43364: URL: https://github.com/apache/spark/pull/43364#issuecomment-1762513910 @dongjoon-hyun It seems to still not work. Do you have any ideas or suggestions for optimizing the compilation options? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45514][SQL][MLLIB] Replace `scala.runtime.Tuple3Zipped` to `scala.collection.LazyZip3` [spark]
beliefer commented on PR #43363: URL: https://github.com/apache/spark/pull/43363#issuecomment-1762510717 ping @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45513][CORE][SQL][MLLIB][CONNECT] Replace `scala.runtime.Tuple2Zipped` to `scala.collection.LazyZip2` [spark]
beliefer commented on PR #43351: URL: https://github.com/apache/spark/pull/43351#issuecomment-1762508892 @LuciferYang @srowen The GA failure is unrelated to this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45534] Use java.lang.ref.Cleaner instead of finalize for RemoteBlockPushResolver [spark]
LuciferYang commented on PR #43371: URL: https://github.com/apache/spark/pull/43371#issuecomment-1762504715 cc @mridulm FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45534] Use java.lang.ref.Cleaner instead of finalize for RemoteBlockPushResolver [spark]
zhaomin1423 commented on PR #43371: URL: https://github.com/apache/spark/pull/43371#issuecomment-1762485442 Please help me review it. @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-455315] Use java.lang.ref.Cleaner instead of finalize for RemoteBlockPushResolver [spark]
zhaomin1423 opened a new pull request, #43371: URL: https://github.com/apache/spark/pull/43371 ### What changes were proposed in this pull request? use java.lang.ref.Cleaner instead of finalize() for RemoteBlockPushResolver ### Why are the changes needed? The finalize() method has been marked as deprecated since Java 9 and will be removed in the future, java.lang.ref.Cleaner is the more recommended solution. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Add Support for Scala 2.13 in Spark 3.4.1 [spark-docker]
holdenk commented on code in PR #52: URL: https://github.com/apache/spark-docker/pull/52#discussion_r1359026750 ## versions.json: ## @@ -40,18 +40,41 @@ "3.4.1" ] }, +{ + "path": "3.4.1/scala2.13-java11-python3-ubuntu", + "tags": [ +"3.4.1-scala2.13-java11-python3-ubuntu" + ] +}, { "path": "3.4.1/scala2.12-java11-r-ubuntu", "tags": [ "3.4.1-scala2.12-java11-r-ubuntu", "3.4.1-r" ] }, +{ + "path": "3.4.1/scala2.13-java11-r-ubuntu", + "tags": [ +"3.4.1-scala2.13-java11-r-ubuntu" + ] +}, { "path": "3.4.1/scala2.12-java11-ubuntu", "tags": [ "3.4.1-scala2.12-java11-ubuntu", -"3.4.1-scala" +"3.4.1-scala2.12", +"scala2.12", +"3.4.1-scala", +"scala" + ] +}, +{ + "path": "3.4.1/scala2.13-java11-ubuntu", + "tags": [ +"3.4.1-scala2.13-java11-ubuntu", +"3.4.1-scala2.13", +"scala2.13" Review Comment: Would we want to do the scala 2.13 tag with 3.5 instead @Yikun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45539][SS] Add assert and log to indicate watermark definition is required for streaming aggregation queries in append mode [spark]
anishshri-db commented on PR #43370: URL: https://github.com/apache/spark/pull/43370#issuecomment-1762462455 cc - @HeartSaVioR - PTAL, thx -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44735][SQL] Add warning msg when inserting columns with the same name by row that don't match up [spark]
holdenk commented on PR #42763: URL: https://github.com/apache/spark/pull/42763#issuecomment-1762461979 Merged to the current branch so should be in 4 :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45539] Add assert and log to indicate watermark definition is required for streaming aggregation queries in append mode [spark]
anishshri-db opened a new pull request, #43370: URL: https://github.com/apache/spark/pull/43370 ### What changes were proposed in this pull request? Add assert and log to indicate watermark definition is required for streaming aggregation queries in append mode ### Why are the changes needed? We have a check for ensuring that watermark attributes are specified in append mode based on the UnsupportedOperationChecker. However, in some cases we got report where user hit this stack trace: ``` org.apache.spark.SparkException: Exception thrown in awaitResult: Job aborted due to stage failure: Task 3 in stage 32.0 failed 4 times, most recent failure: Lost task 3.3 in stage 32.0 (TID 606) (10.5.71.29 executor 0): java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:472) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:708) at org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:145) at org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs$(statefulOperators.scala:145) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec.timeTakenMs(statefulOperators.scala:414) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$5(statefulOperators.scala:470) at org.apache.spark.sql.execution.streaming.state.package$StateStoreOps.$anonfun$mapPartitionsWithStateStore$1(package.scala:63) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:127) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406) ``` In this case, the reason for failure is not immediately clear. Hence adding an assert and log message to indicate why the query failed on the executor. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44735][SQL] Add warning msg when inserting columns with the same name by row that don't match up [spark]
holdenk commented on code in PR #42763: URL: https://github.com/apache/spark/pull/42763#discussion_r1359025079 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala: ## @@ -455,6 +457,19 @@ object TableOutputResolver { } } + def suitableForByNameCheck( + byName: Boolean, + expected: Seq[Attribute], + queryOutput: Seq[Attribute]): Unit = { +if (!byName && expected.size == queryOutput.size && + expected.forall(e => queryOutput.exists(p => conf.resolver(p.name, e.name))) && + expected.zip(queryOutput).exists(e => !conf.resolver(e._1.name, e._2.name))) { + logWarning("The query columns and the table columns have same names but different " + +"orders. You can use INSERT [INTO | OVERWRITE] BY NAME to reorder the query columns to " + +"align with the table columns.") Review Comment: Right but it's a warning message so if a user is doing it intentionally they can ignore the warning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43396][CORE] Add config to control max ratio of decommissioning executors [spark]
holdenk commented on PR #41076: URL: https://github.com/apache/spark/pull/41076#issuecomment-1762460656 Looks like it's failing from scala style checks, can you fix the style issues when you get a chance? Thanks :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45439][SQL][UI] Reduce memory usage of LiveStageMetrics.accumIdsToMetricType [spark]
JoshRosen commented on PR #43250: URL: https://github.com/apache/spark/pull/43250#issuecomment-1762420388 Hmm, it looks like the `OracleIntegrationSuite` tests are flaky but I don't think that's related to this PR's changes: ``` [info] OracleIntegrationSuite: [info] org.apache.spark.sql.jdbc.OracleIntegrationSuite *** ABORTED *** (7 minutes, 38 seconds) [info] The code passed to eventually never returned normally. Attempted 429 times over 7.00309507997 minutes. Last failure message: ORA-12514: Cannot connect to database. Service freepdb1 is not registered with the listener at host 10.1.0.126 port 45139. (CONNECTION_ID=CC2wkBm6SPGoMF7IghzCeQ==). (DockerJDBCIntegrationSuite.scala:166) [info] org.scalatest.exceptions.TestFailedDueToTimeoutException: [info] at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219) [info] at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226) [info] at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313) [info] at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312) [info] at org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]
agubichev commented on code in PR #43111: URL: https://github.com/apache/spark/pull/43111#discussion_r1358990343 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -5272,6 +5281,9 @@ class SQLConf extends Serializable with Logging with SqlApiConf { def decorrelateInnerQueryEnabled: Boolean = getConf(SQLConf.DECORRELATE_INNER_QUERY_ENABLED) + def decorrelateInnerQueryEnabledForExistsIn: Boolean = + !getConf(SQLConf.DECORRELATE_EXISTS_IN_SUBQUERY_LEGACY_INCORRECT_COUNT_HANDLING_ENABLED) Review Comment: the caller checks it: https://github.com/search?q=repo%3Aapache%2Fspark%20decorrelateInnerQueryEnabledForExistsIn=code (first check of the `decorrelate` function, explicit check in CheckAnalysis) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45523][Python] Return useful error message if UDTF returns None for non-nullable column [spark]
dtenedor commented on code in PR #43356: URL: https://github.com/apache/spark/pull/43356#discussion_r1358990069 ## python/pyspark/worker.py: ## @@ -841,6 +841,27 @@ def _remove_partition_by_exprs(self, arg: Any) -> Any: "the query again." ) +# Compute the set of UDTF result columns whose types are not nullable. +# We will check that the UDTF does not return None values for these columns below. +non_nullable_result_cols = set() +for i, field in enumerate(return_type): Review Comment: Thanks for your reviews! I updated this PR to also check recursively for `None` values within array, struct, and map values as well. We can certainly consider other types of UDFs as well later (including Scala UDFs); it seems possible to decouple that work from Python UDTFs here, so I'll leave this PR to focus on the latter for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44210][CONNECT][SQL][PYTHON] Strengthen type checking and better comply with Connect specifications for `levenshtein` function [spark]
github-actions[bot] commented on PR #41724: URL: https://github.com/apache/spark/pull/41724#issuecomment-1762416219 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Move SparkThrowableSuite to spark-common-utils [spark]
github-actions[bot] commented on PR #41851: URL: https://github.com/apache/spark/pull/41851#issuecomment-1762416185 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-43879][CONNECT] Decouple handle command and send response on server side [spark]
github-actions[bot] commented on PR #41527: URL: https://github.com/apache/spark/pull/41527#issuecomment-1762416232 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] upgrade github action [spark]
github-actions[bot] closed pull request #41828: upgrade github action URL: https://github.com/apache/spark/pull/41828 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [MINOR] Eliminate maven build warnings: Using platform locale (en actually) to format date/time, i.e. build is platform dependent! [spark]
github-actions[bot] commented on PR #41842: URL: https://github.com/apache/spark/pull/41842#issuecomment-1762416203 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-34612][CORE] Make outputDeterministicLevel a public API [spark]
github-actions[bot] closed pull request #41853: [SPARK-34612][CORE] Make outputDeterministicLevel a public API URL: https://github.com/apache/spark/pull/41853 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]
andylam-db commented on code in PR #43111: URL: https://github.com/apache/spark/pull/43111#discussion_r1358985912 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -5272,6 +5281,9 @@ class SQLConf extends Serializable with Logging with SqlApiConf { def decorrelateInnerQueryEnabled: Boolean = getConf(SQLConf.DECORRELATE_INNER_QUERY_ENABLED) + def decorrelateInnerQueryEnabledForExistsIn: Boolean = + !getConf(SQLConf.DECORRELATE_EXISTS_IN_SUBQUERY_LEGACY_INCORRECT_COUNT_HANDLING_ENABLED) Review Comment: Should we check whether `decorrelateInnerQueryEnabled` is true here, too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]
siying commented on PR #43338: URL: https://github.com/apache/spark/pull/43338#issuecomment-1762300061 @dongjoon-hyun here it is: https://github.com/facebook/rocksdb/wiki/Setup-Options-and-Basic-Tuning#compression . I'm updating the description. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45427][CORE] Add RPC SSL settings to SSLOptions and SparkTransportConf [spark]
hasnain-db commented on PR #43238: URL: https://github.com/apache/spark/pull/43238#issuecomment-1762251118 test failure looks unrelated, retrying for the third time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45524][PYTHON][SQL] Initial support for Python data source read API [spark]
ueshin commented on code in PR #43360: URL: https://github.com/apache/spark/pull/43360#discussion_r1358824516 ## sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonDataSource.scala: ## @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.python + +import java.io.{DataInputStream, DataOutputStream} + +import scala.collection.mutable.ArrayBuffer + +import net.razorvine.pickle.Pickler + +import org.apache.spark.api.python.{PythonFunction, PythonWorkerUtils, SpecialLengths} +import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} +import org.apache.spark.sql.catalyst.plans.logical.PythonDataSource +import org.apache.spark.sql.catalyst.types.DataTypeUtils.toAttributes +import org.apache.spark.sql.errors.QueryCompilationErrors +import org.apache.spark.sql.types.StructType + +/** + * A user-defined Python data source. This is used by the Python API. + */ +case class UserDefinedPythonDataSource( +dataSource: PythonFunction, +schema: StructType) { + def apply(session: SparkSession): DataFrame = { +val source = PythonDataSource(dataSource, schema, output = toAttributes(schema)) +Dataset.ofRows(session, source) + } +} + +case class PythonDataSourceReadInfo( +func: Array[Byte], +partitions: Seq[Array[Byte]]) + +class UserDefinedPythonDataSourceReadRunner( +func: PythonFunction, +schema: StructType) extends PythonPlannerRunner[PythonDataSourceReadInfo](func) { + + override val workerModule = "pyspark.sql.worker.plan_data_source_read" + + override protected def writeToPython(dataOut: DataOutputStream, pickler: Pickler): Unit = { +// Send Python data source +PythonWorkerUtils.writePythonFunction(func, dataOut) + +// Send schema +PythonWorkerUtils.writeUTF(schema.json, dataOut) + } + + override protected def receiveFromPython(dataIn: DataInputStream): PythonDataSourceReadInfo = { +// Receive the picked reader or an exception raised in Python worker. +val length = dataIn.readInt() +if (length == SpecialLengths.PYTHON_EXCEPTION_THROWN) { + val msg = PythonWorkerUtils.readUTF(dataIn) + throw QueryCompilationErrors.tableValuedFunctionFailedToAnalyseInPythonError(msg) +} + +// Receive the pickled 'read' function. +val pickledFunction: Array[Byte] = { + val obj = new Array[Byte](length) + dataIn.readFully(obj) + obj +} Review Comment: ```suggestion val pickledFunction = PythonWorkerUtils.readBytes(length, dataIn) ``` ## python/pyspark/sql/readwriter.py: ## @@ -303,6 +329,50 @@ def load( if schema is not None: self.schema(schema) self.options(**options) + +# Load a Python data source +if isinstance(self._format, Callable): +# TODO: support path in options. + +# Create an instance of the data source. +data_source_cls = cast(Type[DataSource], self._format) +data_source = data_source_cls(self._options) + +# Get schema of the data source +schema = self._schema or data_source.schema() +if isinstance(schema, str): +schema = _parse_datatype_string(schema) +# Check if the schema is a valid StructType. +if not isinstance(schema, StructType): +raise PySparkTypeError( +error_class="NOT_STR_OR_STRUCT", +message_parameters={ +"arg_name": "schema", +"arg_type": type(schema).__name__, +}, +) + +jschema = self._spark._jsparkSession.parseDataType(schema.json()) +sc = self._spark._sc +pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD( +sc, data_source +) +assert sc._jvm is not None +func = sc._jvm.SimplePythonFunction( +bytearray(pickled_command), +env, +includes, +sc.pythonExec, +sc.pythonVer, +
Re: [PR] [SPARK-45487] Fix SQLSTATEs and temp errors [spark]
gengliangwang closed pull request #43342: [SPARK-45487] Fix SQLSTATEs and temp errors URL: https://github.com/apache/spark/pull/43342 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45487] Fix SQLSTATEs and temp errors [spark]
gengliangwang commented on PR #43342: URL: https://github.com/apache/spark/pull/43342#issuecomment-1762196108 Thanks, merging to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45507][SQL] Correctness fix for correlated scalar subqueries with COUNT aggregates [spark]
andylam-db commented on code in PR #43341: URL: https://github.com/apache/spark/pull/43341#discussion_r1357456302 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala: ## @@ -360,8 +360,40 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper plan.transformExpressionsWithPruning(_.containsPattern(PLAN_EXPRESSION)) { case ScalarSubquery(sub, children, exprId, conditions, hint, mayHaveCountBugOld) if children.nonEmpty => -val (newPlan, newCond) = decorrelate(sub, plan) -val mayHaveCountBug = if (mayHaveCountBugOld.isEmpty) { + +def mayHaveCountBugAgg(a: Aggregate): Boolean = { + a.groupingExpressions.isEmpty && a.aggregateExpressions.exists(_.exists { +case a: AggregateExpression => a.aggregateFunction.defaultResult.isDefined +case _ => false + }) +} + +// The below logic controls handling count bug for scalar subqueries in +// [[DecorrelateInnerQuery]], and if we don't handle it here, we handle it in +// [[RewriteCorrelatedScalarSubquery#constructLeftJoins]]. Note that handling it in +// [[DecorrelateInnerQuery]] is always correct, and turning it off to handle it in +// constructLeftJoins is an optimization, so that additional, redundant left outer joins are +// not introduced. +val handleCountBugInDecorrelate = !conf.getConf( + SQLConf.LEGACY_SCALAR_SUBQUERY_COUNT_BUG_HANDLING) && !(sub match { + // Handle count bug only if there exists lower level Aggs with count bugs. It does not + // matter if the top level agg is count bug vulnerable or not, because: + // 1. If the top level agg is count bug vulnerable, it can be handled in + // constructLeftJoins, unless there are lower aggs that are count bug vulnerable. + // E.g. COUNT(COUNT + COUNT) + // 2. If the top level agg is not count bug vulnerable, it can be count bug vulnerable if + // there are lower aggs that are count bug vulnerable. E.g. SUM(COUNT) + case agg: Aggregate => !agg.child.exists { +case lowerAgg: Aggregate => mayHaveCountBugAgg(lowerAgg) +case _ => false + } + case _ => false +}) +val (newPlan, newCond) = decorrelate(sub, plan, handleCountBugInDecorrelate) +val mayHaveCountBug = if (handleCountBugInDecorrelate) { + // Count bug was already handled in the above decorrelate function call. + false +} else if (mayHaveCountBugOld.isEmpty) { Review Comment: If the plan has already been decorrelated and count bugs handled in `decorrelate()` above, then we can't set `mayHaveCountBug` parameter to be true. This is because if we set it as true based on the pre-rewrite plan (which is what's happening now), running `splitSubquery()` in `constructLeftJoins` will fail because it encounters left outer joins (invalid operators). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45524][PYTHON][SQL] Initial support for Python data source read API [spark]
dtenedor commented on code in PR #43360: URL: https://github.com/apache/spark/pull/43360#discussion_r1358666425 ## python/pyspark/sql/datasource.py: ## @@ -0,0 +1,168 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +from abc import ABC, abstractmethod +from typing import Any, Dict, Iterator, Tuple, Union, TYPE_CHECKING + +from pyspark.sql import Row +from pyspark.sql.types import StructType + +if TYPE_CHECKING: +from pyspark.sql._typing import OptionalPrimitiveType + + +__all__ = ["DataSource", "DataSourceReader"] + + +class DataSource: +def __init__(self, options: Dict[str, "OptionalPrimitiveType"]): +""" +Initializes the data source with user-provided options. + +Parameters +-- +options : dict +A dictionary representing the options for this data source. + +Notes +- +This method should not contain any non-serializable objects. +""" +self.options = options + +@property +def name(self) -> str: +""" +Returns a string represents the short name of this data source. +""" +return self.__class__.__name__ + +def schema(self) -> Union[StructType, str]: +""" +Returns the schema of the data source. + +It can reference the `options` field to infer the data source's schema when +users do not explicitly specify it. This method is invoked once when calling +`spark.read.format(...).load()` to get the schema for a data source read operation. +If this method is not implemented, and a user does not provide a schema when +reading the data source, an exception will be thrown. + +Returns +--- +schema : StructType or str +The schema of this data source or a DDL string represents the schema + +Examples + +Returns a DDL string: + +>>> def schema(self): +>>>return "a INT, b STRING" + +Returns a StructType: + +>>> def schema(self): +>>> return StructType().add("a", "int").add("b", "string") +""" +... + +def reader(self, schema: StructType) -> "DataSourceReader": +""" +Returns a DataSourceReader instance for reading data. + +This method is required for readable data sources. It will be called once during Review Comment: If we want to create a user-defined Python materialization in the future that has no corresponding reader implementation, this API would preclude this. Should we instead change the invariant to "at least one of 'reader' and/or 'writer' must be implemented"? ## python/pyspark/sql/datasource.py: ## @@ -0,0 +1,168 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +from abc import ABC, abstractmethod +from typing import Any, Dict, Iterator, Tuple, Union, TYPE_CHECKING + +from pyspark.sql import Row +from pyspark.sql.types import StructType + +if TYPE_CHECKING: +from pyspark.sql._typing import OptionalPrimitiveType + + +__all__ = ["DataSource", "DataSourceReader"] + + +class DataSource: Review Comment: ```suggestion class DataSource: """ This represents a user-defined table implemented in Python. Subsequent Spark programs can then query from and/or write to the table. This class defines the table's schema and declares an interface to define logic to generate the rows for a scan
Re: [PR] [SPARK-45537][CORE]Fix the issue where the last task may get stuck in a multi-profile [spark]
tgravescs commented on PR #43366: URL: https://github.com/apache/spark/pull/43366#issuecomment-1762123200 thanks, overall issue makes sense, looking in more detail, might be monday before I get time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45537][CORE]Fix the issue where the last task may get stuck in a multi-profile [spark]
mridulm commented on PR #43366: URL: https://github.com/apache/spark/pull/43366#issuecomment-1762042548 +CC @tgravescs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45429][CORE] Add helper classes for SSL RPC communication [spark]
mridulm commented on code in PR #43244: URL: https://github.com/apache/spark/pull/43244#discussion_r1358700597 ## common/network-common/src/main/java/org/apache/spark/network/protocol/EncryptedMessageWithHeader.java: ## @@ -0,0 +1,148 @@ + +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.network.protocol; + +import javax.annotation.Nullable; +import java.io.EOFException; +import java.io.InputStream; Review Comment: nit: Did not notice this earlier - reorder imports, and make sure `javax` comes after `java` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]
andylam-db commented on PR #42725: URL: https://github.com/apache/spark/pull/42725#issuecomment-1761961814 @cloud-fan I think the build is failing because of unrelated failing tests -- timeouts and what not. Could you take a look and see if we can merge this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45508][CORE] Add "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" so Platform can access Cleaner on Java 9+ [spark]
JoshRosen commented on PR #43344: URL: https://github.com/apache/spark/pull/43344#issuecomment-1761931294 > I backported to branch-3.5/3.4/3.5 manually updating from `Assertions.` to `Assert.`. Thanks @dongjoon-hyun! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45516][CONNECT] Include QueryContext in SparkThrowable proto message [spark]
heyihong commented on code in PR #43352: URL: https://github.com/apache/spark/pull/43352#discussion_r1358592923 ## python/pyspark/sql/connect/proto/base_pb2.pyi: ## @@ -2869,6 +2869,59 @@ class FetchErrorDetailsResponse(google.protobuf.message.Message): self, oneof_group: typing_extensions.Literal["_file_name", b"_file_name"] ) -> typing_extensions.Literal["file_name"] | None: ... +class QueryContext(google.protobuf.message.Message): Review Comment: Not yet if I understand correctly. @gatorsmile has some concern on whether exposing QueryContext in the PySpark Exception APIs makes sense for non-sql PySpark exceptions. There is some ongoing discussion for this. ## python/pyspark/sql/connect/proto/base_pb2.pyi: ## @@ -2869,6 +2869,59 @@ class FetchErrorDetailsResponse(google.protobuf.message.Message): self, oneof_group: typing_extensions.Literal["_file_name", b"_file_name"] ) -> typing_extensions.Literal["file_name"] | None: ... +class QueryContext(google.protobuf.message.Message): Review Comment: Not yet if I understand correctly. @gatorsmile has some concern about whether exposing QueryContext in the PySpark Exception APIs makes sense for non-sql PySpark exceptions. There is some ongoing discussion for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45517][CONNECT] Expand more exception constructors to support error framework parameters [spark]
heyihong commented on PR #43368: URL: https://github.com/apache/spark/pull/43368#issuecomment-1761859899 @hvanhovell @juliuszsompolski @MaxGekk @HyukjinKwon Please take a look -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [SPARK-45517][CONNECT] Expand more exception constructors to support error framework parameters [spark]
heyihong opened a new pull request, #43368: URL: https://github.com/apache/spark/pull/43368 ### What changes were proposed in this pull request? - Expand more exception constructors to support error framework parameters ### Why are the changes needed? - Better integration with the error framework ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `build/sbt "connect-client-jvm/testOnly *SparkConnectClientSuite"` ### Was this patch authored or co-authored using generative AI tooling? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45508][CORE] Add "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" so Platform can access Cleaner on Java 9+ [spark]
dongjoon-hyun commented on PR #43344: URL: https://github.com/apache/spark/pull/43344#issuecomment-1761819209 I backported to branch-3.5/3.4/3.5 manually updating from `Assertions.` to `Assert.`. ``` $ build/sbt "unsafe/testOnly *.PlatformUtilSuite" ... [info] Test run org.apache.spark.unsafe.PlatformUtilSuite started [info] Test org.apache.spark.unsafe.PlatformUtilSuite.freeingOnHeapMemoryBlockResetsBaseObjectAndOffset started [info] Test org.apache.spark.unsafe.PlatformUtilSuite.overlappingCopyMemory started [info] Test org.apache.spark.unsafe.PlatformUtilSuite.memoryDebugFillEnabledInTest started [info] Test org.apache.spark.unsafe.PlatformUtilSuite.offHeapMemoryAllocatorThrowsAssertionErrorOnDoubleFree started [info] Test org.apache.spark.unsafe.PlatformUtilSuite.heapMemoryReuse started [info] Test org.apache.spark.unsafe.PlatformUtilSuite.onHeapMemoryAllocatorPoolingReUsesLongArrays started [info] Test org.apache.spark.unsafe.PlatformUtilSuite.onHeapMemoryAllocatorThrowsAssertionErrorOnDoubleFree started [info] Test org.apache.spark.unsafe.PlatformUtilSuite.freeingOffHeapMemoryBlockResetsOffset started [info] Test org.apache.spark.unsafe.PlatformUtilSuite.cleanerCreateMethodIsDefined started [info] Test run org.apache.spark.unsafe.PlatformUtilSuite finished: 0 failed, 0 ignored, 9 total, 0.022s [info] Passed: Total 9, Failed 0, Errors 0, Passed 9 [success] Total time: 7 s, completed Oct 13, 2023, 9:52:18 AM ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45426][CORE] Add support for a ReloadingX509TrustManager [spark]
mridulm closed pull request #43249: [SPARK-45426][CORE] Add support for a ReloadingX509TrustManager URL: https://github.com/apache/spark/pull/43249 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45536][BUILD] Lower the default `-Xmx` of `build/mvn` to 3g [spark]
dongjoon-hyun commented on PR #43364: URL: https://github.com/apache/spark/pull/43364#issuecomment-1761811406 Thank you, @LuciferYang and all. Since Java 17 JVM GC is different than the old ParallelGC, we can optimize further. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45426][CORE] Add support for a ReloadingX509TrustManager [spark]
mridulm commented on PR #43249: URL: https://github.com/apache/spark/pull/43249#issuecomment-1761811698 The test failure is unrelated. Merged to master. Thanks for working on this @hasnain-db ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-44752][SQL] XML: Update Spark Docs [spark]
sandip-db commented on code in PR #43350: URL: https://github.com/apache/spark/pull/43350#discussion_r1358449841 ## examples/src/main/resources/people.xml: ## @@ -0,0 +1,15 @@ + + + +Michael +29 + + +Andy +30 + + +Justin +19 + + Review Comment: ```suggestion ``` ## docs/sql-data-sources-xml.md: ## @@ -0,0 +1,224 @@ +--- +layout: global +title: XML Files +displayTitle: XML Files +license: | +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to You under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--- + +Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a file or directory of files in XML format into a Spark DataFrame, +and `dataframe.write().xml("path")` to write to a xml file. +When reading a XML file, the `rowTag` option need to be specified to indicate the XML element that maps to a `DataFrame row`. The option() function +can be used to customize the behavior of reading or writing, such as controlling behavior of the XML attributes, XSD validation, compression, and so +on. + + + + +{% include_example xml_dataset scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} + + + +{% include_example xml_dataset java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} + + + + +## Data Source Option + +Data source options of XML can be set via: + +* the `.option`/`.options` methods of +* `DataFrameReader` +* `DataFrameWriter` +* `DataStreamReader` +* `DataStreamWriter` +* the built-in functions below +* `from_xml` +* `to_xml` +* `schema_of_xml` +* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) + + + Property NameDefaultMeaningScope + +rowTag +ROW +The row tag of your xml files to treat as a row. For example, in this xml: ... the appropriate value would be book. +read + + + +samplingRatio +1.0 +Defines fraction of rows used for schema inferring. XML built-in functions ignore this option. +read + + + +excludeAttribute +false +Whether to exclude attributes in elements. +read + + + +mode +PERMISSIVE +Allows a mode for dealing with corrupt records during parsing. + + PERMISSIVE: when it meets a corrupted record, puts the malformed string into a field configured by columnNameOfCorruptRecord, and sets malformed fields to null. To keep corrupt records, an user can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a columnNameOfCorruptRecord field in an output schema. + DROPMALFORMED: ignores the whole corrupted records. This mode is unsupported in the JSON built-in functions. + FAILFAST: throws an exception when it meets corrupted records. + + +read + + + + inferSchema + true + If true, attempts to infer an appropriate type for each resulting DataFrame column. If false, all resulting columns are of string type. Default is true. XML built-in functions ignore this option. + read + + + + columnNameOfCorruptRecord + spark.sql.columnNameOfCorruptRecord + Allows renaming the new field having a malformed string created by PERMISSIVE mode. + read + + + +attributePrefix +_ +The prefix for attributes to differentiate attributes from elements. This will be the prefix for field names. Default is _. Can be empty for reading XML, but not for writing. +read/write + + + +valueTag +_VALUE +The tag used for the value when there are attributes in the element having no child. +read/write + + + +encoding +UTF-8 +For reading, decodes the XML files by the given encoding type. For writing, specifies encoding (charset) of saved XML files. XML built-in functions ignore this option. +read/write + + + +ignoreSurroundingSpaces +false +Defines whether surrounding whitespaces from values being read should be skipped. +read + + + + rowValidationXSDPath + null + Path to an optional XSD file that is used to validate the XML
Re: [PR] [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled [spark]
asfgit closed pull request #43323: [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled URL: https://github.com/apache/spark/pull/43323 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled [spark]
tgravescs commented on PR #43323: URL: https://github.com/apache/spark/pull/43323#issuecomment-1761738892 merged to master and branch-3.5. Thanks @wbo4958 @mridulm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP] Add support for java 17 and explicit Python versions from spark 3.5.0 onwards [spark-docker]
Yikun commented on code in PR #56: URL: https://github.com/apache/spark-docker/pull/56#discussion_r1358357340 ## 3.5.0/scala2.12-java11-python3-r-ubuntu/Dockerfile: ## @@ -20,7 +20,10 @@ USER root RUN set -ex; \ apt-get update; \ -apt-get install -y python3 python3-pip; \ +apt install -y software-properties-common; \ +add-apt-repository ppa:deadsnakes/ppa; \ +apt install python3.10; \ Review Comment: Is there any special reason why we use the python 3.10? I prefer to use os default python3 version from matainence cost view, and also os default python version has more stable quality and security. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP] Add support for java 17 and explicit Python versions from spark 3.5.0 onwards [spark-docker]
Yikun commented on code in PR #56: URL: https://github.com/apache/spark-docker/pull/56#discussion_r1358357340 ## 3.5.0/scala2.12-java11-python3-r-ubuntu/Dockerfile: ## @@ -20,7 +20,10 @@ USER root RUN set -ex; \ apt-get update; \ -apt-get install -y python3 python3-pip; \ +apt install -y software-properties-common; \ +add-apt-repository ppa:deadsnakes/ppa; \ +apt install python3.10; \ Review Comment: Is there any special reason why we use the python 3.10? I prefer to use os default python3 version from matainence cost view, and also os default python version has more stable quality. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45536][BUILD] Lower the default `-Xmx` of `build/mvn` to 3g [spark]
LuciferYang commented on PR #43364: URL: https://github.com/apache/spark/pull/43364#issuecomment-1761659955 Merge into master to observe the `Publish Snapshot` job, if it doesn't work, we can revert it tomorrow. Thanks @HyukjinKwon @beliefer @EnricoMi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45536][BUILD] Lower the default `-Xmx` of `build/mvn` to 3g [spark]
LuciferYang closed pull request #43364: [SPARK-45536][BUILD] Lower the default `-Xmx` of `build/mvn` to 3g URL: https://github.com/apache/spark/pull/43364 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45536][BUILD] Lower the default `-Xmx` of `build/mvn` to 3g [spark]
LuciferYang commented on PR #43364: URL: https://github.com/apache/spark/pull/43364#issuecomment-1761654980 ``` [info] *** 1 TEST FAILED *** [error] Failed tests: [error] org.apache.spark.sql.kafka010.KafkaSourceStressSuite [error] (sql-kafka-0-10 / Test / test) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 2132 s (35:32), completed Oct 13, 2023, 12:07:56 PM ``` Only `KafkaSourceStressSuite` test failed, this is a known flaky test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Add Support for Scala 2.13 in Spark 3.4.1 [spark-docker]
Yikun commented on code in PR #52: URL: https://github.com/apache/spark-docker/pull/52#discussion_r1358346064 ## 3.4.1/scala2.12-java11-ubuntu/entrypoint.sh: ## @@ -77,6 +77,9 @@ elif ! [ -z "${SPARK_HOME+x}" ]; then SPARK_CLASSPATH="$SPARK_HOME/conf:$SPARK_CLASSPATH"; fi +# SPARK-43540: add current working directory into executor classpath Review Comment: cc @HyukjinKwon @zhengruifeng @dongjoon-hyun Or we could just apply master dockerfile changes in 3.4 images? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-40513] Switch 3.4.0 default Java to Java17 [spark-docker]
Yikun commented on PR #35: URL: https://github.com/apache/spark-docker/pull/35#issuecomment-1761638285 Replaced by https://github.com/apache/spark-docker/pull/56 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-40513] Switch 3.4.0 default Java to Java17 [spark-docker]
Yikun closed pull request #35: [SPARK-40513] Switch 3.4.0 default Java to Java17 URL: https://github.com/apache/spark-docker/pull/35 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP] Add support for java 17 and explicit Python versions from spark 3.5.0 onwards [spark-docker]
Yikun commented on code in PR #56: URL: https://github.com/apache/spark-docker/pull/56#discussion_r1358361784 ## tools/template.py: ## @@ -59,7 +59,7 @@ def parse_opts(): parser.add_argument( "-j", "--java-version", -help="The Spark version of Dockerfile.", +help="Java version of Dockerfile.", Review Comment: Good catch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP] Add support for java 17 and explicit Python versions from spark 3.5.0 onwards [spark-docker]
Yikun commented on code in PR #56: URL: https://github.com/apache/spark-docker/pull/56#discussion_r1358361340 ## .github/workflows/test.yml: ## @@ -37,12 +37,15 @@ on: - 3.3.0 java: description: 'The Java version of Spark image.' -default: 11 +default: "11" Review Comment: Is it neccessary? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP] Add support for java 17 and explicit Python versions from spark 3.5.0 onwards [spark-docker]
Yikun commented on code in PR #56: URL: https://github.com/apache/spark-docker/pull/56#discussion_r1358359041 ## add-dockerfiles.sh: ## @@ -44,12 +48,20 @@ for TAG in $TAGS; do if echo $TAG | grep -q "r-"; then OPTS+=" --sparkr" fi + +if echo $TAG | grep -q "java17"; then +OPTS+=" --java-version 17 --image eclipse-temurin:17-jre-jammy" +fi +if echo $TAG | grep -q "java11"; then Review Comment: elif? ## add-dockerfiles.sh: ## @@ -44,12 +48,20 @@ for TAG in $TAGS; do if echo $TAG | grep -q "r-"; then OPTS+=" --sparkr" fi + +if echo $TAG | grep -q "java17"; then +OPTS+=" --java-version 17 --image eclipse-temurin:17-jre-jammy" Review Comment: Greate! ## 3.5.0/scala2.12-java11-python3-r-ubuntu/Dockerfile: ## @@ -20,7 +20,10 @@ USER root RUN set -ex; \ apt-get update; \ -apt-get install -y python3 python3-pip; \ +apt install -y software-properties-common; \ +add-apt-repository ppa:deadsnakes/ppa; \ +apt install python3.10; \ Review Comment: Is there any special reason why we use the python 3.10? I prefer to use os default python3 version from matainence cost view. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] Add Support for Scala 2.13 in Spark 3.4.1 [spark-docker]
Yikun commented on code in PR #52: URL: https://github.com/apache/spark-docker/pull/52#discussion_r1358343497 ## 3.4.1/scala2.12-java11-ubuntu/entrypoint.sh: ## @@ -77,6 +77,9 @@ elif ! [ -z "${SPARK_HOME+x}" ]; then SPARK_CLASSPATH="$SPARK_HOME/conf:$SPARK_CLASSPATH"; fi +# SPARK-43540: add current working directory into executor classpath Review Comment: This is 3.5.x series, we might create a `entrypoint.sh.3.4.template`, `Dockerfile.3.4.template`. We also [change](https://github.com/apache/spark-docker/blob/master/add-dockerfiles.sh#L53) the `add-dockerfiles.sh`, For 3.x.1 version, if `3.x` templates exists, use 3.x template, otherwise use the master entrypoint and Dockerfile directly. It can be a separate PR. ## versions.json: ## @@ -40,18 +40,41 @@ "3.4.1" ] }, +{ + "path": "3.4.1/scala2.13-java11-python3-ubuntu", + "tags": [ +"3.4.1-scala2.13-java11-python3-ubuntu" + ] +}, { "path": "3.4.1/scala2.12-java11-r-ubuntu", "tags": [ "3.4.1-scala2.12-java11-r-ubuntu", "3.4.1-r" ] }, +{ + "path": "3.4.1/scala2.13-java11-r-ubuntu", + "tags": [ +"3.4.1-scala2.13-java11-r-ubuntu" + ] +}, { "path": "3.4.1/scala2.12-java11-ubuntu", "tags": [ "3.4.1-scala2.12-java11-ubuntu", -"3.4.1-scala" +"3.4.1-scala2.12", +"scala2.12", +"3.4.1-scala", +"scala" Review Comment: scala should be removed, because it's 3.5.0 now ## 3.4.1/scala2.12-java11-ubuntu/entrypoint.sh: ## @@ -77,6 +77,9 @@ elif ! [ -z "${SPARK_HOME+x}" ]; then SPARK_CLASSPATH="$SPARK_HOME/conf:$SPARK_CLASSPATH"; fi +# SPARK-43540: add current working directory into executor classpath Review Comment: cc @HyukjinKwon @zhengruifeng @dongjoon-hyun Or we could just apply master dockerfile changes in 3.4-branches? ## .github/workflows/build_3.4.1.yaml: ## @@ -24,18 +24,18 @@ on: branches: - 'master' paths: - - '3.4.1/**' + - '3.4.1/scala2.**' jobs: run-build: strategy: matrix: image-type: ["all", "python", "scala", "r"] +scala: ["2.13","2.12"] name: Run secrets: inherit uses: ./.github/workflows/main.yml with: spark: 3.4.1 - scala: 2.12 Review Comment: This shouldn't be removed, so CI is break (not triggered). Try ```bash scala: ${{ matrix.scala }} ``` [1] https://github.com/apache/spark-docker/actions/runs/6398139681/workflow ## 3.4.1/scala2.12-java11-ubuntu/entrypoint.sh: ## @@ -90,6 +93,7 @@ case "$1" in CMD=( "$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" + --conf "spark.executorEnv.SPARK_DRIVER_POD_IP=$SPARK_DRIVER_BIND_ADDRESS" Review Comment: ditto ## versions.json: ## @@ -40,18 +40,41 @@ "3.4.1" ] }, +{ + "path": "3.4.1/scala2.13-java11-python3-ubuntu", + "tags": [ +"3.4.1-scala2.13-java11-python3-ubuntu" + ] +}, { "path": "3.4.1/scala2.12-java11-r-ubuntu", "tags": [ "3.4.1-scala2.12-java11-r-ubuntu", "3.4.1-r" ] }, +{ + "path": "3.4.1/scala2.13-java11-r-ubuntu", + "tags": [ +"3.4.1-scala2.13-java11-r-ubuntu" + ] +}, { "path": "3.4.1/scala2.12-java11-ubuntu", "tags": [ "3.4.1-scala2.12-java11-ubuntu", -"3.4.1-scala" +"3.4.1-scala2.12", +"scala2.12", Review Comment: scala2.12 should be removed, because it's 3.5.0 now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[PR] [WIP] Add support for java 17 and explicit Python versions from 3.5.0 [spark-docker]
vakarisbk opened a new pull request, #56: URL: https://github.com/apache/spark-docker/pull/56 ### What changes were proposed in this pull request? 1. Create Java17 base images alongside Java11 images starting from spark 3.5.0 2. Add the ability to explicitly define Python versions 3. Change ubuntu version to 22.04 for `scala2.12-java17-*` ### Why are the changes needed? Spark supports multiple versions of Java and Spark and some community members have a need to use specific versions of Java and Python for their use cases. Adding this option would simplify workflows for these users and make Spark more accessible. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45468][UI] More transparent proxy handling for HTTP redirects [spark]
nsuke commented on PR #43297: URL: https://github.com/apache/spark/pull/43297#issuecomment-1761482605 Hi @yaooqinn @dongjoon-hyun @srowen could you review this? This makes Web UI proxy setup simpler. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]
cloud-fan commented on code in PR #41855: URL: https://github.com/apache/spark/pull/41855#discussion_r1358221878 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala: ## @@ -114,22 +115,19 @@ object JdbcUtils extends Logging with SQLConfHelper { isCaseSensitive: Boolean, dialect: JdbcDialect): String = { Review Comment: note: `getInsertStatement` is not a public API but we don't stop people from calling `JdbcUtils` as there is no `private[spark]`. I'm fine keeping it to avoid troubles for third-party Spark vendors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-45531][SQL][DOCS] Add more comments and rename some variable name for InjectRuntimeFilter [spark]
cloud-fan commented on PR #43359: URL: https://github.com/apache/spark/pull/43359#issuecomment-1761466792 what variables did you rename? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-44752][SQL] XML: Update Spark Docs [spark]
beliefer commented on PR #43350: URL: https://github.com/apache/spark/pull/43350#issuecomment-1761452245 > Thank you very much for your help Please refer https://github.com/apache/spark/blob/master/docs/README.md -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]
Hisoka-X commented on code in PR #41855: URL: https://github.com/apache/spark/pull/41855#discussion_r1358194353 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala: ## @@ -104,34 +105,6 @@ object JdbcUtils extends Logging with SQLConfHelper { JdbcDialects.get(url).isCascadingTruncateTable() } - /** - * Returns an Insert SQL statement for inserting a row into the target table via JDBC conn. - */ - def getInsertStatement( Review Comment: addressed all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]
beliefer commented on code in PR #41855: URL: https://github.com/apache/spark/pull/41855#discussion_r1358189908 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala: ## @@ -670,7 +643,6 @@ object JdbcUtils extends Logging with SQLConfHelper { * updated even with error if it doesn't support transaction, as there're dirty outputs. */ def savePartition( - table: String, Review Comment: Yes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]
Hisoka-X commented on code in PR #41855: URL: https://github.com/apache/spark/pull/41855#discussion_r1358187351 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala: ## @@ -670,7 +643,6 @@ object JdbcUtils extends Logging with SQLConfHelper { * updated even with error if it doesn't support transaction, as there're dirty outputs. */ def savePartition( - table: String, Review Comment: Should I revert this as well? @beliefer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [WIP][SPARK-45531][SQL][DOCS] Add more comments and rename some variable name for InjectRuntimeFilter [spark]
beliefer commented on PR #43359: URL: https://github.com/apache/spark/pull/43359#issuecomment-1761418972 The failure is unrelated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]
beliefer commented on code in PR #41855: URL: https://github.com/apache/spark/pull/41855#discussion_r1358154414 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala: ## @@ -104,34 +105,6 @@ object JdbcUtils extends Logging with SQLConfHelper { JdbcDialects.get(url).isCascadingTruncateTable() } - /** - * Returns an Insert SQL statement for inserting a row into the target table via JDBC conn. - */ - def getInsertStatement( Review Comment: This is a public API too. Shall we refactor `getInsertStatement` by use `dialect.insertIntoTable` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]
Hisoka-X commented on code in PR #41855: URL: https://github.com/apache/spark/pull/41855#discussion_r1358145882 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala: ## @@ -854,7 +854,6 @@ object JdbcUtils extends Logging with SQLConfHelper { def saveTable( df: DataFrame, tableSchema: Option[StructType], - isCaseSensitive: Boolean, Review Comment: done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]
beliefer commented on code in PR #41855: URL: https://github.com/apache/spark/pull/41855#discussion_r1358121363 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala: ## @@ -854,7 +854,6 @@ object JdbcUtils extends Logging with SQLConfHelper { def saveTable( df: DataFrame, tableSchema: Option[StructType], - isCaseSensitive: Boolean, Review Comment: This is a public API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45486][CONNECT] Make add_artifact request idempotent [spark]
cdkrot commented on code in PR #43314: URL: https://github.com/apache/spark/pull/43314#discussion_r1358099191 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/artifact/SparkConnectArtifactManager.scala: ## @@ -82,6 +83,40 @@ class SparkConnectArtifactManager(sessionHolder: SessionHolder) extends Logging */ def getSparkConnectPythonIncludes: Seq[String] = pythonIncludeList.asScala.toSeq + private def areFilesEqual(path1: Path, path2: Path): Boolean = { Review Comment: Added this, thanks for the feedback! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45486][CONNECT] Make add_artifact request idempotent [spark]
cdkrot commented on code in PR #43314: URL: https://github.com/apache/spark/pull/43314#discussion_r1358090333 ## connector/connect/server/src/main/scala/org/apache/spark/sql/connect/artifact/SparkConnectArtifactManager.scala: ## @@ -82,6 +83,40 @@ class SparkConnectArtifactManager(sessionHolder: SessionHolder) extends Logging */ def getSparkConnectPythonIncludes: Seq[String] = pythonIncludeList.asScala.toSeq + private def areFilesEqual(path1: Path, path2: Path): Boolean = { Review Comment: That sounds good, let me see if we can use it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]
Hisoka-X commented on code in PR #41855: URL: https://github.com/apache/spark/pull/41855#discussion_r1358087806 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala: ## @@ -854,7 +854,6 @@ object JdbcUtils extends Logging with SQLConfHelper { def saveTable( df: DataFrame, tableSchema: Option[StructType], - isCaseSensitive: Boolean, Review Comment: Is there any special reason? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]
beliefer commented on code in PR #41855: URL: https://github.com/apache/spark/pull/41855#discussion_r1358085152 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala: ## @@ -854,7 +854,6 @@ object JdbcUtils extends Logging with SQLConfHelper { def saveTable( df: DataFrame, tableSchema: Option[StructType], - isCaseSensitive: Boolean, Review Comment: Please not change the signature. We still not use it is OK. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org