date:20231013

Re: [PR] [SPARK-42669][CONNECT] Short circuit local relation RPCs [spark]

2023-10-13 Thread via GitHub



beliefer commented on PR #40782:
URL: https://github.com/apache/spark/pull/40782#issuecomment-1762631600

   @hvanhovell Do we need this change?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-42669][CONNECT] Short circuit local relation RPCs [spark]

2023-10-13 Thread via GitHub



beliefer opened a new pull request, #40782:
URL: https://github.com/apache/spark/pull/40782

   ### What changes were proposed in this pull request?
   Operations on `LocalRelation` can mostly be done locally (without sending 
RPCs).
   We should leverage this.
   
   
   ### Why are the changes needed?
   Avoid sending RPCs for `LocalRelation`.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   New feature.
   
   
   ### How was this patch tested?
   Exists test cases.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-42746][SQL] Add the LISTAGG() aggregate function [spark]

2023-10-13 Thread via GitHub



beliefer commented on code in PR #42398:
URL: https://github.com/apache/spark/pull/42398#discussion_r1359157402


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ListAgg.scala:
##
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.trees.UnaryLike
+import org.apache.spark.sql.catalyst.types.PhysicalDataType
+import org.apache.spark.sql.types.{DataType, StringType}
+import org.apache.spark.unsafe.types.UTF8String
+import org.apache.spark.util.collection.OpenHashMap
+
+@ExpressionDescription(
+  usage = "_FUNC_(expr) - Returns the concatenated input values," +
+" separated by the delimiter string.",
+  examples = """
+Examples:
+  > SELECT _FUNC_(col) FROM VALUES ('a'), ('b'), ('c') AS tab(col);
+   a,b,c
+  > SELECT _FUNC_(col) FROM VALUES (NULL), ('a'), ('b') AS tab(col);
+   a,b
+  > SELECT _FUNC_(col) FROM VALUES ('a'), ('a') AS tab(col);
+   a,a
+  > SELECT _FUNC_(DISTINCT col) FROM VALUES ('a'), ('a'), ('b') AS 
tab(col);
+   a,b
+  > SELECT _FUNC_(col, '|') FROM VALUES ('a'), ('b') AS tab(col);
+   a|b
+  > SELECT _FUNC_(col) FROM VALUES (NULL), (NULL) AS tab(col);
+   NULL
+  """,
+  group = "agg_funcs",
+  since = "4.0.0")
+case class ListAgg(
+child: Expression,
+delimiter: Expression = Literal.create(",", StringType),
+reverse: Boolean = false,
+mutableAggBufferOffset: Int = 0,
+inputAggBufferOffset: Int = 0) extends TypedAggregateWithHashMapAsBuffer
+  with UnaryLike[Expression] {
+
+  def this(child: Expression) = this(child, Literal.create(",", StringType), 
false, 0, 0)
+  def this(child: Expression, delimiter: Expression) = this(child, delimiter, 
false, 0, 0)
+
+  override def update(
+  buffer: OpenHashMap[AnyRef, Long],

Review Comment:
   You can composite them. Please refer `PercentileCont` or `RegrSlope`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44837][SQL] Improve ALTER TABLE ALTER PARTITION column error message [spark]

2023-10-13 Thread via GitHub



beliefer commented on code in PR #42524:
URL: https://github.com/apache/spark/pull/42524#discussion_r1359145115


##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala:
##
@@ -377,8 +377,13 @@ case class AlterTableChangeColumnCommand(
 val resolver = sparkSession.sessionState.conf.resolver
 DDLUtils.verifyAlterTableType(catalog, table, isView = false)
 
-// Find the origin column from dataSchema by column name.
+// check that the column is not a partition column

Review Comment:
   `check` -> `Check`



##
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala:
##
@@ -377,8 +377,13 @@ case class AlterTableChangeColumnCommand(
 val resolver = sparkSession.sessionState.conf.resolver
 DDLUtils.verifyAlterTableType(catalog, table, isView = false)
 
-// Find the origin column from dataSchema by column name.
+// check that the column is not a partition column
+if (findColumnByName(table.partitionSchema, columnName, 
resolver).isDefined) {

Review Comment:
   How about `table.partitionSchema.fieldNames.exists(resolver(_, columnName))` 
?
   
   So we no need change `findColumnByName` and related code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44649][SQL] Runtime Filter supports passing equivalent creation side expressions [spark]

2023-10-13 Thread via GitHub



beliefer commented on PR #42317:
URL: https://github.com/apache/spark/pull/42317#issuecomment-1762596244

   The GA failure is unrelated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45531][SQL][DOCS] Add more comments and rename some variable name for InjectRuntimeFilter [spark]

2023-10-13 Thread via GitHub



beliefer commented on PR #43359:
URL: https://github.com/apache/spark/pull/43359#issuecomment-1762593379

   > what variables did you rename?
   
   Done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45530][CORE] Use `java.lang.ref.Cleaner` instead of `finalize` for `NioBufferedFileInputStream` [spark]

2023-10-13 Thread via GitHub



mridulm commented on PR #4:
URL: https://github.com/apache/spark/pull/4#issuecomment-1762592047

   Looks good to me, thanks for fixing this @LuciferYang !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44307][SQL] Add Bloom filter for left outer join even if the left side table is smaller than broadcast threshold. [spark]

2023-10-13 Thread via GitHub



beliefer commented on code in PR #41860:
URL: https://github.com/apache/spark/pull/41860#discussion_r1359115220


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InjectRuntimeFilter.scala:
##
@@ -322,9 +332,6 @@ object InjectRuntimeFilter extends Rule[LogicalPlan] with 
PredicateHelper with J
 isSimpleExpression(l) && isSimpleExpression(r)) {
 val oldLeft = newLeft
 val oldRight = newRight
-// Check if the current join is a shuffle join or a broadcast join 
that
-// has a shuffle below it
-val hasShuffle = isProbablyShuffleJoin(left, right, hint)

Review Comment:
   Shall we not move this line?
   Although move it to line 323 is also correct. We should do it lazily.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [MINOR][CORE] Add `@Deprecated` for `SparkLauncher#DEPRECATED_CHILD_CONNECTION_TIMEOUT` [spark]

2023-10-13 Thread via GitHub



LuciferYang opened a new pull request, #43374:
URL: https://github.com/apache/spark/pull/43374

   ### What changes were proposed in this pull request?
   This pr just add `@Deprecated` for 
`SparkLauncher#DEPRECATED_CHILD_CONNECTION_TIMEOUT`.
   
   
   ### Why are the changes needed?
   From the javadoc, `DEPRECATED_CHILD_CONNECTION_TIMEOUT` has been deprecated, 
so it should carry a `@Deprecated` marker.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Pass GitHub Actions
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45534] Use java.lang.ref.Cleaner instead of finalize for RemoteBlockPushResolver [spark]

2023-10-13 Thread via GitHub



beliefer commented on code in PR #43371:
URL: https://github.com/apache/spark/pull/43371#discussion_r1359110742


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -1933,6 +1933,39 @@ public RoaringBitmap getMapTracker() {
 int getNumIOExceptions() {
   return numIOExceptions;
 }
+
+private record ResourceCleaner(
+File dataFile,
+FileChannel dataChannel,
+MergeShuffleFile indexFile,
+MergeShuffleFile metaFile,
+AppAttemptShuffleMergeId appAttemptShuffleMergeId,
+int reduceId) implements Runnable {
+
+  @Override
+  public void run() {
+try {

Review Comment:
   +1 reuse `AppShufflePartitionInfo#closeAllFilesAndDeleteIfNeeded`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45484][SQL][3.5] Deprecated the incorrect parquet compression codec lz4raw [spark]

2023-10-13 Thread via GitHub



beliefer commented on PR #43330:
URL: https://github.com/apache/spark/pull/43330#issuecomment-1762547574

   cc @dongjoon-hyun @wangyum 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45484][SQL][3.5] Deprecated the incorrect parquet compression codec lz4raw [spark]

2023-10-13 Thread via GitHub



beliefer commented on PR #43330:
URL: https://github.com/apache/spark/pull/43330#issuecomment-1762547364

   The GA failure is unrelated to this PR.
   `Linters, licenses, dependencies and documentation generation`
   ```
   /usr/lib/ruby/2.7.0/fileutils.rb:1415:in `copy_stream': No space left on 
device - copy_file_range (Errno::ENOSPC)
   
[12872](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12872)
from /usr/lib/ruby/2.7.0/fileutils.rb:1415:in `block (2 levels) in 
copy_file'
   
[12873](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12873)
from /usr/lib/ruby/2.7.0/fileutils.rb:1414:in `open'
   
[12874](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12874)
from /usr/lib/ruby/2.7.0/fileutils.rb:1414:in `block in copy_file'
   
[12875](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12875)
from /usr/lib/ruby/2.7.0/fileutils.rb:1413:in `open'
   
[12876](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12876)
from /usr/lib/ruby/2.7.0/fileutils.rb:1413:in `copy_file'
   
[12877](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12877)
from /usr/lib/ruby/2.7.0/fileutils.rb:1378:in `copy'
   
[12879](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12879)
   Error: No space left on device : 
'/home/runner/runners/2.309.0/_diag/pages/b7db6e67-b9c1-4011-a807-e832ba0cf437_fef911b0-771d-54c9-1f3c-e23e2c04b8fb_1.log'
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45427][CORE] Add RPC SSL settings to SSLOptions and SparkTransportConf [spark]

2023-10-13 Thread via GitHub



hasnain-db commented on PR #43238:
URL: https://github.com/apache/spark/pull/43238#issuecomment-1762546624

   `org.apache.spark.sql.kafka010.KafkaSourceStressSuite` is failing on this 
and the other outstanding PR, and seems to be unrelated to the changes here. I 
wonder if that is related to the new kafka version bump since it looks like it 
also failed on 
https://github.com/apache/spark/actions/runs/6497942586/job/17648192318


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45429][CORE] Add helper classes for SSL RPC communication [spark]

2023-10-13 Thread via GitHub



hasnain-db commented on PR #43244:
URL: https://github.com/apache/spark/pull/43244#issuecomment-1762546614

   `org.apache.spark.sql.kafka010.KafkaSourceStressSuite` is failing on this 
and the other outstanding PR, and seems to be unrelated to the changes here. I 
wonder if that is related to the new kafka version bump since it looks like it 
also failed on 
https://github.com/apache/spark/actions/runs/6497942586/job/17648192318


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45513][CORE][SQL][MLLIB][CONNECT] Replace `scala.runtime.Tuple2Zipped` to `scala.collection.LazyZip2` [spark]

2023-10-13 Thread via GitHub



beliefer commented on PR #43351:
URL: https://github.com/apache/spark/pull/43351#issuecomment-1762545714

   @LuciferYang @srowen Thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45514][SQL][MLLIB] Replace `scala.runtime.Tuple3Zipped` to `scala.collection.LazyZip3` [spark]

2023-10-13 Thread via GitHub



beliefer commented on PR #43363:
URL: https://github.com/apache/spark/pull/43363#issuecomment-1762545286

   The GA failure is unrelated to this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45514][SQL][MLLIB] Replace `scala.runtime.Tuple3Zipped` to `scala.collection.LazyZip3` [spark]

2023-10-13 Thread via GitHub



beliefer commented on PR #43363:
URL: https://github.com/apache/spark/pull/43363#issuecomment-1762545070

   cc @HyukjinKwon @srowen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45513][CORE][SQL][MLLIB][CONNECT] Replace `scala.runtime.Tuple2Zipped` to `scala.collection.LazyZip2` [spark]

2023-10-13 Thread via GitHub



beliefer closed pull request #43351: [SPARK-45513][CORE][SQL][MLLIB][CONNECT] 
Replace `scala.runtime.Tuple2Zipped` to `scala.collection.LazyZip2`
URL: https://github.com/apache/spark/pull/43351


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44527][SQL] Replace ScalarSubquery with null if its maxRows is 0 [spark]

2023-10-13 Thread via GitHub



wangyum commented on code in PR #42129:
URL: https://github.com/apache/spark/pull/42129#discussion_r1359102724


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala:
##
@@ -88,6 +88,10 @@ object ConstantFolding extends Rule[LogicalPlan] {
   e
   }
 
+// Replace ScalarSubquery with null if its maxRows is 0
+case s: ScalarSubquery if s.plan.maxRows.contains(0) =>

Review Comment:
   @jchen5 Do we need to consider `mayHaveCountBug`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44210][CONNECT][SQL][PYTHON] Strengthen type checking and better comply with Connect specifications for `levenshtein` function [spark]

2023-10-13 Thread via GitHub



LuciferYang commented on PR #41724:
URL: https://github.com/apache/spark/pull/41724#issuecomment-1762543078

   Don't we need this PR？


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45530][CORE] Use `java.lang.ref.Cleaner` instead of `finalize` for `NioBufferedFileInputStream` [spark]

2023-10-13 Thread via GitHub



LuciferYang commented on code in PR #4:
URL: https://github.com/apache/spark/pull/4#discussion_r1359102252


##
core/src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java:
##
@@ -32,8 +34,11 @@
  */
 public final class NioBufferedFileInputStream extends InputStream {
 
+  private static final Cleaner CLEANER = Cleaner.create();

Review Comment:
   
[a9f3025](https://github.com/apache/spark/pull/4/commits/a9f3025afb2381f573ee0c3e146da9d0b704593b)
 rename this from `cleaner` to `CLEANER` due to it's static



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-44662] Perf improvement in BroadcastHashJoin queries with stream side join key on non partition columns [spark]

2023-10-13 Thread via GitHub



ahshahid opened a new pull request, #43373:
URL: https://github.com/apache/spark/pull/43373

   What changes were proposed in this pull request?
   On the lines of DPP which helps DataSourceV2 relations when the joining key 
is a partition column, the same concept can be extended over to the case where 
joining key is not a partition column.
   In this PR, the keys available in the BroadcastHashJoinExec are pushed down 
to the DataSourceV2 scans in form of a SortedSet structure.
   For non partition columns, the DataSources like iceberg have max/min stats 
on columns available at manifest level, and for formats like parquet , they 
have max/min stats at various data structure levels. The passed SortedSet can 
be used to prune using ranges at both driver level ( manifests files) as well 
as executor level ( while actually going through chunks , row groups etc at 
parquet level)
   
   If the data is stored as Columnar Batch format , then it would not be 
possible to filter out individual row at DataSource level, even though we have 
keys.
   But at the scan level, ( ColumnToRowExec) it is still possible to filter out 
as many rows as possible , if the query involves nested joins. Thus reducing 
the number of rows to join at the higher join levels.
   
   Attaching link to a presentation which outlines the idea:
   [Broadcast Keys 
pushdown](https://docs.google.com/presentation/d/165Rx7i00TmAKnDJpSQLfrcrW-ShrzPy5/edit?usp=drive_link)
   
   SPIP : [SPIP-44662](https://issues.apache.org/jira/browse/SPARK-44662)
   
   Why are the changes needed?
   There is scope of improvement in the performance of Inner and Left Semi join 
queries when using BroadcastHashJoin
   
   Does this PR introduce any user-facing change?
   No
   
   How was this patch tested?
   Ran TPCDS suite using iceberg as DataSource.
   Converted many of the existing Spark Query tests to also run using iceberg 
as data source.
   Will be adding more unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45534] Use java.lang.ref.Cleaner instead of finalize for RemoteBlockPushResolver [spark]

2023-10-13 Thread via GitHub



LuciferYang commented on code in PR #43371:
URL: https://github.com/apache/spark/pull/43371#discussion_r1359100137


##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -1933,6 +1933,39 @@ public RoaringBitmap getMapTracker() {
 int getNumIOExceptions() {
   return numIOExceptions;
 }
+
+private record ResourceCleaner(
+File dataFile,
+FileChannel dataChannel,
+MergeShuffleFile indexFile,
+MergeShuffleFile metaFile,
+AppAttemptShuffleMergeId appAttemptShuffleMergeId,
+int reduceId) implements Runnable {
+
+  @Override
+  public void run() {
+try {

Review Comment:
   My only question is that this introduces some duplicate code with the 
`AppShufflePartitionInfo#closeAllFilesAndDeleteIfNeeded` method. Is there 
really no way to eliminate this duplication?
   
   



##
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java:
##
@@ -1756,6 +1759,8 @@ public static class AppShufflePartitionInfo {
   this.dataFilePos = 0;
   this.mapTracker = new RoaringBitmap();
   this.chunkTracker = new RoaringBitmap();
+  CLEANER.register(this, new ResourceCleaner(dataFile, dataChannel, 
indexFile,
+  metaFile, appAttemptShuffleMergeId, reduceId));

Review Comment:
   Indentation: 2 spaces



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Add Support for Scala 2.13 in Spark 3.4.1 [spark-docker]

2023-10-13 Thread via GitHub



Yikun commented on code in PR #52:
URL: https://github.com/apache/spark-docker/pull/52#discussion_r1359079892


##
versions.json:
##
@@ -40,18 +40,41 @@
 "3.4.1"
   ]
 },
+{
+  "path": "3.4.1/scala2.13-java11-python3-ubuntu",
+  "tags": [
+"3.4.1-scala2.13-java11-python3-ubuntu"
+  ]
+},
 {
   "path": "3.4.1/scala2.12-java11-r-ubuntu",
   "tags": [
 "3.4.1-scala2.12-java11-r-ubuntu",
 "3.4.1-r"
   ]
 },
+{
+  "path": "3.4.1/scala2.13-java11-r-ubuntu",
+  "tags": [
+"3.4.1-scala2.13-java11-r-ubuntu"
+  ]
+},
 {
   "path": "3.4.1/scala2.12-java11-ubuntu",
   "tags": [
 "3.4.1-scala2.12-java11-ubuntu",
-"3.4.1-scala"
+"3.4.1-scala2.12",
+"scala2.12",
+"3.4.1-scala",
+"scala"
+  ]
+},
+{
+  "path": "3.4.1/scala2.13-java11-ubuntu",
+  "tags": [
+"3.4.1-scala2.13-java11-ubuntu",
+"3.4.1-scala2.13",
+"scala2.13"

Review Comment:
   Yes, agree! As I mentioned above comment, 2.13 better to introduce in latest 
version (3.5.0 for now).
   
   https://github.com/apache/spark-docker/pull/52#pullrequestreview-1676547189



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Add Support for Scala 2.13 in Spark 3.4.1 [spark-docker]

2023-10-13 Thread via GitHub



Yikun commented on code in PR #52:
URL: https://github.com/apache/spark-docker/pull/52#discussion_r1359079892


##
versions.json:
##
@@ -40,18 +40,41 @@
 "3.4.1"
   ]
 },
+{
+  "path": "3.4.1/scala2.13-java11-python3-ubuntu",
+  "tags": [
+"3.4.1-scala2.13-java11-python3-ubuntu"
+  ]
+},
 {
   "path": "3.4.1/scala2.12-java11-r-ubuntu",
   "tags": [
 "3.4.1-scala2.12-java11-r-ubuntu",
 "3.4.1-r"
   ]
 },
+{
+  "path": "3.4.1/scala2.13-java11-r-ubuntu",
+  "tags": [
+"3.4.1-scala2.13-java11-r-ubuntu"
+  ]
+},
 {
   "path": "3.4.1/scala2.12-java11-ubuntu",
   "tags": [
 "3.4.1-scala2.12-java11-ubuntu",
-"3.4.1-scala"
+"3.4.1-scala2.12",
+"scala2.12",
+"3.4.1-scala",
+"scala"
+  ]
+},
+{
+  "path": "3.4.1/scala2.13-java11-ubuntu",
+  "tags": [
+"3.4.1-scala2.13-java11-ubuntu",
+"3.4.1-scala2.13",
+"scala2.13"

Review Comment:
   Yes, agree! As I mentioned above comment, 2.13 better to introduce in latest 
versio (3.5.0 for now)
   
   https://github.com/apache/spark-docker/pull/52#pullrequestreview-1676547189



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45530][CORE] Use `java.lang.ref.Cleaner` instead of `finalize` for `NioBufferedFileInputStream` [spark]

2023-10-13 Thread via GitHub



LuciferYang commented on PR #4:
URL: https://github.com/apache/spark/pull/4#issuecomment-1762516078

   ```
   [info] *** 1 TEST FAILED ***
   [error] Failed tests:
   [error]  org.apache.spark.sql.kafka010.KafkaSourceStressSuite
   [warn] In the last 2048 seconds, 6.951 (0.3%) were spent in GC. [Heap: 
3.23GB free of 4.00GB, max 4.00GB] Consider increasing the JVM heap using 
`-Xmx` or try a different collector, e.g. `-XX:+UseG1GC`, for better 
performance.
   [error] (sql-kafka-0-10 / Test / test) sbt.TestsFailedException: Tests 
unsuccessful
   [error] Total time: 2049 s (34:09), completed Oct 13, 2023, 4:05:45 PM
   ```
   This is a case that started failing after upgrading to Kafka 3.6, the author 
is fixing it, it's unrelated to this PR. Could you take another look at this pr 
If you have time, thanks 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Revert "[SPARK-45536][BUILD] Lower the default `-Xmx` of `build/mvn` to 3g" [spark]

2023-10-13 Thread via GitHub



LuciferYang commented on PR #43372:
URL: https://github.com/apache/spark/pull/43372#issuecomment-1762514978

   cc @HyukjinKwon @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] Revert "[SPARK-45536][BUILD] Lower the default `-Xmx` of `build/mvn` to 3g" [spark]

2023-10-13 Thread via GitHub



LuciferYang opened a new pull request, #43372:
URL: https://github.com/apache/spark/pull/43372

   This reverts commit 3e2470de7ea8b97dcdd8875ef25f044998fb7588.
   
   ### What changes were proposed in this pull request?
   This pr revert change of https://github.com/apache/spark/pull/43364. 
   
   ### Why are the changes needed?
   It seems to have no effect on fixing `Publish snapshot`, it still failed
   - https://github.com/apache/spark/actions/runs/6514229181/job/17696846279
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Pass GitHub Actions
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45536][BUILD] Lower the default `-Xmx` of `build/mvn` to 3g [spark]

2023-10-13 Thread via GitHub



LuciferYang commented on PR #43364:
URL: https://github.com/apache/spark/pull/43364#issuecomment-1762513910

   @dongjoon-hyun It seems to still not work. Do you have any ideas or 
suggestions for optimizing the compilation options?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45514][SQL][MLLIB] Replace `scala.runtime.Tuple3Zipped` to `scala.collection.LazyZip3` [spark]

2023-10-13 Thread via GitHub



beliefer commented on PR #43363:
URL: https://github.com/apache/spark/pull/43363#issuecomment-1762510717

   ping @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45513][CORE][SQL][MLLIB][CONNECT] Replace `scala.runtime.Tuple2Zipped` to `scala.collection.LazyZip2` [spark]

2023-10-13 Thread via GitHub



beliefer commented on PR #43351:
URL: https://github.com/apache/spark/pull/43351#issuecomment-1762508892

   @LuciferYang @srowen The GA failure is unrelated to this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45534] Use java.lang.ref.Cleaner instead of finalize for RemoteBlockPushResolver [spark]

2023-10-13 Thread via GitHub



LuciferYang commented on PR #43371:
URL: https://github.com/apache/spark/pull/43371#issuecomment-1762504715

   cc @mridulm FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45534] Use java.lang.ref.Cleaner instead of finalize for RemoteBlockPushResolver [spark]

2023-10-13 Thread via GitHub



zhaomin1423 commented on PR #43371:
URL: https://github.com/apache/spark/pull/43371#issuecomment-1762485442

   Please help me review it. @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-455315] Use java.lang.ref.Cleaner instead of finalize for RemoteBlockPushResolver [spark]

2023-10-13 Thread via GitHub



zhaomin1423 opened a new pull request, #43371:
URL: https://github.com/apache/spark/pull/43371

   
   
   ### What changes were proposed in this pull request?
   
   use java.lang.ref.Cleaner instead of finalize() for RemoteBlockPushResolver
   
   ### Why are the changes needed?
   
   The finalize() method has been marked as deprecated since Java 9 and will be 
removed in the future, java.lang.ref.Cleaner is the more recommended solution.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Pass GitHub Actions
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Add Support for Scala 2.13 in Spark 3.4.1 [spark-docker]

2023-10-13 Thread via GitHub



holdenk commented on code in PR #52:
URL: https://github.com/apache/spark-docker/pull/52#discussion_r1359026750


##
versions.json:
##
@@ -40,18 +40,41 @@
 "3.4.1"
   ]
 },
+{
+  "path": "3.4.1/scala2.13-java11-python3-ubuntu",
+  "tags": [
+"3.4.1-scala2.13-java11-python3-ubuntu"
+  ]
+},
 {
   "path": "3.4.1/scala2.12-java11-r-ubuntu",
   "tags": [
 "3.4.1-scala2.12-java11-r-ubuntu",
 "3.4.1-r"
   ]
 },
+{
+  "path": "3.4.1/scala2.13-java11-r-ubuntu",
+  "tags": [
+"3.4.1-scala2.13-java11-r-ubuntu"
+  ]
+},
 {
   "path": "3.4.1/scala2.12-java11-ubuntu",
   "tags": [
 "3.4.1-scala2.12-java11-ubuntu",
-"3.4.1-scala"
+"3.4.1-scala2.12",
+"scala2.12",
+"3.4.1-scala",
+"scala"
+  ]
+},
+{
+  "path": "3.4.1/scala2.13-java11-ubuntu",
+  "tags": [
+"3.4.1-scala2.13-java11-ubuntu",
+"3.4.1-scala2.13",
+"scala2.13"

Review Comment:
   Would we want to do the scala 2.13 tag with 3.5 instead  @Yikun 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45539][SS] Add assert and log to indicate watermark definition is required for streaming aggregation queries in append mode [spark]

2023-10-13 Thread via GitHub



anishshri-db commented on PR #43370:
URL: https://github.com/apache/spark/pull/43370#issuecomment-1762462455

   cc - @HeartSaVioR - PTAL, thx


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44735][SQL] Add warning msg when inserting columns with the same name by row that don't match up [spark]

2023-10-13 Thread via GitHub



holdenk commented on PR #42763:
URL: https://github.com/apache/spark/pull/42763#issuecomment-1762461979

   Merged to the current branch so should be in 4 :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45539] Add assert and log to indicate watermark definition is required for streaming aggregation queries in append mode [spark]

2023-10-13 Thread via GitHub



anishshri-db opened a new pull request, #43370:
URL: https://github.com/apache/spark/pull/43370

   ### What changes were proposed in this pull request?
   Add assert and log to indicate watermark definition is required for 
streaming aggregation queries in append mode
   
   
   ### Why are the changes needed?
   We have a check for ensuring that watermark attributes are specified in 
append mode based on the UnsupportedOperationChecker. However, in some cases we 
got report where user hit this stack trace:
   
   ```
   org.apache.spark.SparkException: Exception thrown in awaitResult: Job 
aborted due to stage failure: Task 3 in stage 32.0 failed 4 times, most recent 
failure: Lost task 3.3 in stage 32.0 (TID 606) (10.5.71.29 executor 0): 
java.util.NoSuchElementException: None.get
   at scala.None$.get(Option.scala:529)
   at scala.None$.get(Option.scala:527)
   at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:472)
   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:708)
   at 
org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:145)
   at 
org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs$(statefulOperators.scala:145)
   at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec.timeTakenMs(statefulOperators.scala:414)
   at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$5(statefulOperators.scala:470)
   at 
org.apache.spark.sql.execution.streaming.state.package$StateStoreOps.$anonfun$mapPartitionsWithStateStore$1(package.scala:63)
   at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:127)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406)
   ```
   
   In this case, the reason for failure is not immediately clear. Hence adding 
an assert and log message to indicate why the query failed on the executor.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Existing unit tests
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44735][SQL] Add warning msg when inserting columns with the same name by row that don't match up [spark]

2023-10-13 Thread via GitHub



holdenk commented on code in PR #42763:
URL: https://github.com/apache/spark/pull/42763#discussion_r1359025079


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala:
##
@@ -455,6 +457,19 @@ object TableOutputResolver {
 }
   }
 
+  def suitableForByNameCheck(
+  byName: Boolean,
+  expected: Seq[Attribute],
+  queryOutput: Seq[Attribute]): Unit = {
+if (!byName && expected.size == queryOutput.size &&
+  expected.forall(e => queryOutput.exists(p => conf.resolver(p.name, 
e.name))) &&
+  expected.zip(queryOutput).exists(e => !conf.resolver(e._1.name, 
e._2.name))) {
+  logWarning("The query columns and the table columns have same names but 
different " +
+"orders. You can use INSERT [INTO | OVERWRITE] BY NAME to reorder the 
query columns to " +
+"align with the table columns.")

Review Comment:
   Right but it's a warning message so if a user is doing it intentionally they 
can ignore the warning.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-43396][CORE] Add config to control max ratio of decommissioning executors [spark]

2023-10-13 Thread via GitHub



holdenk commented on PR #41076:
URL: https://github.com/apache/spark/pull/41076#issuecomment-1762460656

   Looks like it's failing from scala style checks, can you fix the style 
issues when you get a chance? Thanks :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45439][SQL][UI] Reduce memory usage of LiveStageMetrics.accumIdsToMetricType [spark]

2023-10-13 Thread via GitHub



JoshRosen commented on PR #43250:
URL: https://github.com/apache/spark/pull/43250#issuecomment-1762420388

   Hmm, it looks like the `OracleIntegrationSuite` tests are flaky but I don't 
think that's related to this PR's changes:
   
   ```
   [info] OracleIntegrationSuite:
   [info] org.apache.spark.sql.jdbc.OracleIntegrationSuite *** ABORTED *** (7 
minutes, 38 seconds)
   [info]   The code passed to eventually never returned normally. Attempted 
429 times over 7.00309507997 minutes. Last failure message: ORA-12514: 
Cannot connect to database. Service freepdb1 is not registered with the 
listener at host 10.1.0.126 port 45139. 
(CONNECTION_ID=CC2wkBm6SPGoMF7IghzCeQ==). (DockerJDBCIntegrationSuite.scala:166)
   [info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
   [info]   at 
org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:219)
   [info]   at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226)
   [info]   at 
org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313)
   [info]   at 
org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312)
   [info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.eventually(DockerJDBCIntegrationSuite.scala:95)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

2023-10-13 Thread via GitHub



agubichev commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1358990343


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -5272,6 +5281,9 @@ class SQLConf extends Serializable with Logging with 
SqlApiConf {
 
   def decorrelateInnerQueryEnabled: Boolean = 
getConf(SQLConf.DECORRELATE_INNER_QUERY_ENABLED)
 
+  def decorrelateInnerQueryEnabledForExistsIn: Boolean =
+
!getConf(SQLConf.DECORRELATE_EXISTS_IN_SUBQUERY_LEGACY_INCORRECT_COUNT_HANDLING_ENABLED)

Review Comment:
   the caller checks it:
   
https://github.com/search?q=repo%3Aapache%2Fspark%20decorrelateInnerQueryEnabledForExistsIn=code
   
   (first check of the `decorrelate` function, explicit check in CheckAnalysis)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45523][Python] Return useful error message if UDTF returns None for non-nullable column [spark]

2023-10-13 Thread via GitHub



dtenedor commented on code in PR #43356:
URL: https://github.com/apache/spark/pull/43356#discussion_r1358990069


##
python/pyspark/worker.py:
##
@@ -841,6 +841,27 @@ def _remove_partition_by_exprs(self, arg: Any) -> Any:
 "the query again."
 )
 
+# Compute the set of UDTF result columns whose types are not nullable.
+# We will check that the UDTF does not return None values for these 
columns below.
+non_nullable_result_cols = set()
+for i, field in enumerate(return_type):

Review Comment:
   Thanks for your reviews! I updated this PR to also check recursively for 
`None` values within array, struct, and map values as well.
   We can certainly consider other types of UDFs as well later (including Scala 
UDFs); it seems possible to decouple that work from Python UDTFs here, so I'll 
leave this PR to focus on the latter for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44210][CONNECT][SQL][PYTHON] Strengthen type checking and better comply with Connect specifications for `levenshtein` function [spark]

2023-10-13 Thread via GitHub



github-actions[bot] commented on PR #41724:
URL: https://github.com/apache/spark/pull/41724#issuecomment-1762416219

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Move SparkThrowableSuite to spark-common-utils [spark]

2023-10-13 Thread via GitHub



github-actions[bot] commented on PR #41851:
URL: https://github.com/apache/spark/pull/41851#issuecomment-1762416185

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-43879][CONNECT] Decouple handle command and send response on server side [spark]

2023-10-13 Thread via GitHub



github-actions[bot] commented on PR #41527:
URL: https://github.com/apache/spark/pull/41527#issuecomment-1762416232

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] upgrade github action [spark]

2023-10-13 Thread via GitHub



github-actions[bot] closed pull request #41828: upgrade github action
URL: https://github.com/apache/spark/pull/41828


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR] Eliminate maven build warnings: Using platform locale (en actually) to format date/time, i.e. build is platform dependent! [spark]

2023-10-13 Thread via GitHub



github-actions[bot] commented on PR #41842:
URL: https://github.com/apache/spark/pull/41842#issuecomment-1762416203

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-34612][CORE] Make outputDeterministicLevel a public API [spark]

2023-10-13 Thread via GitHub



github-actions[bot] closed pull request #41853: [SPARK-34612][CORE] Make 
outputDeterministicLevel a public API
URL: https://github.com/apache/spark/pull/41853


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

2023-10-13 Thread via GitHub



andylam-db commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1358985912


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -5272,6 +5281,9 @@ class SQLConf extends Serializable with Logging with 
SqlApiConf {
 
   def decorrelateInnerQueryEnabled: Boolean = 
getConf(SQLConf.DECORRELATE_INNER_QUERY_ENABLED)
 
+  def decorrelateInnerQueryEnabledForExistsIn: Boolean =
+
!getConf(SQLConf.DECORRELATE_EXISTS_IN_SUBQUERY_LEGACY_INCORRECT_COUNT_HANDLING_ENABLED)

Review Comment:
   Should we check whether `decorrelateInnerQueryEnabled` is true here, too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45503][SS] RocksDB to use LZ4 [spark]

2023-10-13 Thread via GitHub



siying commented on PR #43338:
URL: https://github.com/apache/spark/pull/43338#issuecomment-1762300061

   @dongjoon-hyun here it is: 
https://github.com/facebook/rocksdb/wiki/Setup-Options-and-Basic-Tuning#compression
 . I'm updating the description.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45427][CORE] Add RPC SSL settings to SSLOptions and SparkTransportConf [spark]

2023-10-13 Thread via GitHub



hasnain-db commented on PR #43238:
URL: https://github.com/apache/spark/pull/43238#issuecomment-1762251118

   test failure looks unrelated, retrying for the third time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45524][PYTHON][SQL] Initial support for Python data source read API [spark]

2023-10-13 Thread via GitHub



ueshin commented on code in PR #43360:
URL: https://github.com/apache/spark/pull/43360#discussion_r1358824516


##
sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonDataSource.scala:
##
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import java.io.{DataInputStream, DataOutputStream}
+
+import scala.collection.mutable.ArrayBuffer
+
+import net.razorvine.pickle.Pickler
+
+import org.apache.spark.api.python.{PythonFunction, PythonWorkerUtils, 
SpecialLengths}
+import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
+import org.apache.spark.sql.catalyst.plans.logical.PythonDataSource
+import org.apache.spark.sql.catalyst.types.DataTypeUtils.toAttributes
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.types.StructType
+
+/**
+ * A user-defined Python data source. This is used by the Python API.
+ */
+case class UserDefinedPythonDataSource(
+dataSource: PythonFunction,
+schema: StructType) {
+  def apply(session: SparkSession): DataFrame = {
+val source = PythonDataSource(dataSource, schema, output = 
toAttributes(schema))
+Dataset.ofRows(session, source)
+  }
+}
+
+case class PythonDataSourceReadInfo(
+func: Array[Byte],
+partitions: Seq[Array[Byte]])
+
+class UserDefinedPythonDataSourceReadRunner(
+func: PythonFunction,
+schema: StructType) extends 
PythonPlannerRunner[PythonDataSourceReadInfo](func) {
+
+  override val workerModule = "pyspark.sql.worker.plan_data_source_read"
+
+  override protected def writeToPython(dataOut: DataOutputStream, pickler: 
Pickler): Unit = {
+// Send Python data source
+PythonWorkerUtils.writePythonFunction(func, dataOut)
+
+// Send schema
+PythonWorkerUtils.writeUTF(schema.json, dataOut)
+  }
+
+  override protected def receiveFromPython(dataIn: DataInputStream): 
PythonDataSourceReadInfo = {
+// Receive the picked reader or an exception raised in Python worker.
+val length = dataIn.readInt()
+if (length == SpecialLengths.PYTHON_EXCEPTION_THROWN) {
+  val msg = PythonWorkerUtils.readUTF(dataIn)
+  throw 
QueryCompilationErrors.tableValuedFunctionFailedToAnalyseInPythonError(msg)
+}
+
+// Receive the pickled 'read' function.
+val pickledFunction: Array[Byte] = {
+  val obj = new Array[Byte](length)
+  dataIn.readFully(obj)
+  obj
+}

Review Comment:
   ```suggestion
   val pickledFunction = PythonWorkerUtils.readBytes(length, dataIn)
   ```



##
python/pyspark/sql/readwriter.py:
##
@@ -303,6 +329,50 @@ def load(
 if schema is not None:
 self.schema(schema)
 self.options(**options)
+
+# Load a Python data source
+if isinstance(self._format, Callable):
+# TODO: support path in options.
+
+# Create an instance of the data source.
+data_source_cls = cast(Type[DataSource], self._format)
+data_source = data_source_cls(self._options)
+
+# Get schema of the data source
+schema = self._schema or data_source.schema()
+if isinstance(schema, str):
+schema = _parse_datatype_string(schema)
+# Check if the schema is a valid StructType.
+if not isinstance(schema, StructType):
+raise PySparkTypeError(
+error_class="NOT_STR_OR_STRUCT",
+message_parameters={
+"arg_name": "schema",
+"arg_type": type(schema).__name__,
+},
+)
+
+jschema = self._spark._jsparkSession.parseDataType(schema.json())
+sc = self._spark._sc
+pickled_command, broadcast_vars, env, includes = 
_prepare_for_python_RDD(
+sc, data_source
+)
+assert sc._jvm is not None
+func = sc._jvm.SimplePythonFunction(
+bytearray(pickled_command),
+env,
+includes,
+sc.pythonExec,
+sc.pythonVer,
+

Re: [PR] [SPARK-45487] Fix SQLSTATEs and temp errors [spark]

2023-10-13 Thread via GitHub



gengliangwang closed pull request #43342: [SPARK-45487] Fix SQLSTATEs and temp 
errors
URL: https://github.com/apache/spark/pull/43342


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45487] Fix SQLSTATEs and temp errors [spark]

2023-10-13 Thread via GitHub



gengliangwang commented on PR #43342:
URL: https://github.com/apache/spark/pull/43342#issuecomment-1762196108

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45507][SQL] Correctness fix for correlated scalar subqueries with COUNT aggregates [spark]

2023-10-13 Thread via GitHub



andylam-db commented on code in PR #43341:
URL: https://github.com/apache/spark/pull/43341#discussion_r1357456302


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala:
##
@@ -360,8 +360,40 @@ object PullupCorrelatedPredicates extends 
Rule[LogicalPlan] with PredicateHelper
 plan.transformExpressionsWithPruning(_.containsPattern(PLAN_EXPRESSION)) {
   case ScalarSubquery(sub, children, exprId, conditions, hint, 
mayHaveCountBugOld)
 if children.nonEmpty =>
-val (newPlan, newCond) = decorrelate(sub, plan)
-val mayHaveCountBug = if (mayHaveCountBugOld.isEmpty) {
+
+def mayHaveCountBugAgg(a: Aggregate): Boolean = {
+  a.groupingExpressions.isEmpty && 
a.aggregateExpressions.exists(_.exists {
+case a: AggregateExpression => 
a.aggregateFunction.defaultResult.isDefined
+case _ => false
+  })
+}
+
+// The below logic controls handling count bug for scalar subqueries in
+// [[DecorrelateInnerQuery]], and if we don't handle it here, we 
handle it in
+// [[RewriteCorrelatedScalarSubquery#constructLeftJoins]]. Note that 
handling it in
+// [[DecorrelateInnerQuery]] is always correct, and turning it off to 
handle it in
+// constructLeftJoins is an optimization, so that additional, 
redundant left outer joins are
+// not introduced.
+val handleCountBugInDecorrelate = !conf.getConf(
+  SQLConf.LEGACY_SCALAR_SUBQUERY_COUNT_BUG_HANDLING) && !(sub match {
+  // Handle count bug only if there exists lower level Aggs with count 
bugs. It does not
+  // matter if the top level agg is count bug vulnerable or not, 
because:
+  // 1. If the top level agg is count bug vulnerable, it can be 
handled in
+  // constructLeftJoins, unless there are lower aggs that are count 
bug vulnerable.
+  // E.g. COUNT(COUNT + COUNT)
+  // 2. If the top level agg is not count bug vulnerable, it can be 
count bug vulnerable if
+  // there are lower aggs that are count bug vulnerable. E.g. 
SUM(COUNT)
+  case agg: Aggregate => !agg.child.exists {
+case lowerAgg: Aggregate => mayHaveCountBugAgg(lowerAgg)
+case _ => false
+  }
+  case _ => false
+})
+val (newPlan, newCond) = decorrelate(sub, plan, 
handleCountBugInDecorrelate)
+val mayHaveCountBug = if (handleCountBugInDecorrelate) {
+  // Count bug was already handled in the above decorrelate function 
call.
+  false
+} else if (mayHaveCountBugOld.isEmpty) {

Review Comment:
   If the plan has already been decorrelated and count bugs handled in 
`decorrelate()` above, then we can't set `mayHaveCountBug` parameter to be 
true. This is because if we set it as true based on the pre-rewrite plan (which 
is what's happening now), running `splitSubquery()` in `constructLeftJoins` 
will fail because it encounters left outer joins (invalid operators).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45524][PYTHON][SQL] Initial support for Python data source read API [spark]

2023-10-13 Thread via GitHub



dtenedor commented on code in PR #43360:
URL: https://github.com/apache/spark/pull/43360#discussion_r1358666425


##
python/pyspark/sql/datasource.py:
##
@@ -0,0 +1,168 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+from abc import ABC, abstractmethod
+from typing import Any, Dict, Iterator, Tuple, Union, TYPE_CHECKING
+
+from pyspark.sql import Row
+from pyspark.sql.types import StructType
+
+if TYPE_CHECKING:
+from pyspark.sql._typing import OptionalPrimitiveType
+
+
+__all__ = ["DataSource", "DataSourceReader"]
+
+
+class DataSource:
+def __init__(self, options: Dict[str, "OptionalPrimitiveType"]):
+"""
+Initializes the data source with user-provided options.
+
+Parameters
+--
+options : dict
+A dictionary representing the options for this data source.
+
+Notes
+-
+This method should not contain any non-serializable objects.
+"""
+self.options = options
+
+@property
+def name(self) -> str:
+"""
+Returns a string represents the short name of this data source.
+"""
+return self.__class__.__name__
+
+def schema(self) -> Union[StructType, str]:
+"""
+Returns the schema of the data source.
+
+It can reference the `options` field to infer the data source's schema 
when
+users do not explicitly specify it. This method is invoked once when 
calling
+`spark.read.format(...).load()` to get the schema for a data source 
read operation.
+If this method is not implemented, and a user does not provide a 
schema when
+reading the data source, an exception will be thrown.
+
+Returns
+---
+schema : StructType or str
+The schema of this data source or a DDL string represents the 
schema
+
+Examples
+
+Returns a DDL string:
+
+>>> def schema(self):
+>>>return "a INT, b STRING"
+
+Returns a StructType:
+
+>>> def schema(self):
+>>>   return StructType().add("a", "int").add("b", "string")
+"""
+...
+
+def reader(self, schema: StructType) -> "DataSourceReader":
+"""
+Returns a DataSourceReader instance for reading data.
+
+This method is required for readable data sources. It will be called 
once during

Review Comment:
   If we want to create a user-defined Python materialization in the future 
that has no corresponding reader implementation, this API would preclude this. 
Should we instead change the invariant to "at least one of 'reader' and/or 
'writer' must be implemented"?



##
python/pyspark/sql/datasource.py:
##
@@ -0,0 +1,168 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+from abc import ABC, abstractmethod
+from typing import Any, Dict, Iterator, Tuple, Union, TYPE_CHECKING
+
+from pyspark.sql import Row
+from pyspark.sql.types import StructType
+
+if TYPE_CHECKING:
+from pyspark.sql._typing import OptionalPrimitiveType
+
+
+__all__ = ["DataSource", "DataSourceReader"]
+
+
+class DataSource:

Review Comment:
   ```suggestion
   class DataSource:
   """
   This represents a user-defined table implemented in Python.
   Subsequent Spark programs can then query from and/or write to the table.
   This class defines the table's schema and declares an interface to define
   logic to generate the rows for a scan

Re: [PR] [SPARK-45537][CORE]Fix the issue where the last task may get stuck in a multi-profile [spark]

2023-10-13 Thread via GitHub



tgravescs commented on PR #43366:
URL: https://github.com/apache/spark/pull/43366#issuecomment-1762123200

   thanks, overall issue makes sense, looking in more detail, might be monday 
before I get time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45537][CORE]Fix the issue where the last task may get stuck in a multi-profile [spark]

2023-10-13 Thread via GitHub



mridulm commented on PR #43366:
URL: https://github.com/apache/spark/pull/43366#issuecomment-1762042548

   +CC @tgravescs 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45429][CORE] Add helper classes for SSL RPC communication [spark]

2023-10-13 Thread via GitHub



mridulm commented on code in PR #43244:
URL: https://github.com/apache/spark/pull/43244#discussion_r1358700597


##
common/network-common/src/main/java/org/apache/spark/network/protocol/EncryptedMessageWithHeader.java:
##
@@ -0,0 +1,148 @@
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.network.protocol;
+
+import javax.annotation.Nullable;
+import java.io.EOFException;
+import java.io.InputStream;

Review Comment:
   nit: Did not notice this earlier - reorder imports, and make sure `javax` 
comes after `java`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]

2023-10-13 Thread via GitHub



andylam-db commented on PR #42725:
URL: https://github.com/apache/spark/pull/42725#issuecomment-1761961814

   @cloud-fan I think the build is failing because of unrelated failing tests 
-- timeouts and what not. Could you take a look and see if we can merge this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45508][CORE] Add "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" so Platform can access Cleaner on Java 9+ [spark]

2023-10-13 Thread via GitHub



JoshRosen commented on PR #43344:
URL: https://github.com/apache/spark/pull/43344#issuecomment-1761931294

   > I backported to branch-3.5/3.4/3.5 manually updating from `Assertions.` to 
`Assert.`.
   
   Thanks @dongjoon-hyun!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45516][CONNECT] Include QueryContext in SparkThrowable proto message [spark]

2023-10-13 Thread via GitHub



heyihong commented on code in PR #43352:
URL: https://github.com/apache/spark/pull/43352#discussion_r1358592923


##
python/pyspark/sql/connect/proto/base_pb2.pyi:
##
@@ -2869,6 +2869,59 @@ class 
FetchErrorDetailsResponse(google.protobuf.message.Message):
 self, oneof_group: typing_extensions.Literal["_file_name", 
b"_file_name"]
 ) -> typing_extensions.Literal["file_name"] | None: ...
 
+class QueryContext(google.protobuf.message.Message):

Review Comment:
   Not yet if I understand correctly. @gatorsmile has some concern on whether 
exposing QueryContext in the PySpark Exception APIs makes sense for non-sql 
PySpark exceptions. There is some ongoing discussion for this.



##
python/pyspark/sql/connect/proto/base_pb2.pyi:
##
@@ -2869,6 +2869,59 @@ class 
FetchErrorDetailsResponse(google.protobuf.message.Message):
 self, oneof_group: typing_extensions.Literal["_file_name", 
b"_file_name"]
 ) -> typing_extensions.Literal["file_name"] | None: ...
 
+class QueryContext(google.protobuf.message.Message):

Review Comment:
   Not yet if I understand correctly. @gatorsmile has some concern about 
whether exposing QueryContext in the PySpark Exception APIs makes sense for 
non-sql PySpark exceptions. There is some ongoing discussion for this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45517][CONNECT] Expand more exception constructors to support error framework parameters [spark]

2023-10-13 Thread via GitHub



heyihong commented on PR #43368:
URL: https://github.com/apache/spark/pull/43368#issuecomment-1761859899

   @hvanhovell @juliuszsompolski @MaxGekk @HyukjinKwon Please take a look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45517][CONNECT] Expand more exception constructors to support error framework parameters [spark]

2023-10-13 Thread via GitHub



heyihong opened a new pull request, #43368:
URL: https://github.com/apache/spark/pull/43368

   
   
   ### What changes were proposed in this pull request?
   
   
   - Expand more exception constructors to support error framework parameters
   
   ### Why are the changes needed?
   
   
   - Better integration with the error framework
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No
   
   ### How was this patch tested?
   
   
   `build/sbt "connect-client-jvm/testOnly *SparkConnectClientSuite"`
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45508][CORE] Add "--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED" so Platform can access Cleaner on Java 9+ [spark]

2023-10-13 Thread via GitHub



dongjoon-hyun commented on PR #43344:
URL: https://github.com/apache/spark/pull/43344#issuecomment-1761819209

   I backported to branch-3.5/3.4/3.5 manually updating from `Assertions.` to 
`Assert.`.
   ```
   $ build/sbt "unsafe/testOnly *.PlatformUtilSuite"
   ...
   [info] Test run org.apache.spark.unsafe.PlatformUtilSuite started
   [info] Test 
org.apache.spark.unsafe.PlatformUtilSuite.freeingOnHeapMemoryBlockResetsBaseObjectAndOffset
 started
   [info] Test org.apache.spark.unsafe.PlatformUtilSuite.overlappingCopyMemory 
started
   [info] Test 
org.apache.spark.unsafe.PlatformUtilSuite.memoryDebugFillEnabledInTest started
   [info] Test 
org.apache.spark.unsafe.PlatformUtilSuite.offHeapMemoryAllocatorThrowsAssertionErrorOnDoubleFree
 started
   [info] Test org.apache.spark.unsafe.PlatformUtilSuite.heapMemoryReuse started
   [info] Test 
org.apache.spark.unsafe.PlatformUtilSuite.onHeapMemoryAllocatorPoolingReUsesLongArrays
 started
   [info] Test 
org.apache.spark.unsafe.PlatformUtilSuite.onHeapMemoryAllocatorThrowsAssertionErrorOnDoubleFree
 started
   [info] Test 
org.apache.spark.unsafe.PlatformUtilSuite.freeingOffHeapMemoryBlockResetsOffset 
started
   [info] Test 
org.apache.spark.unsafe.PlatformUtilSuite.cleanerCreateMethodIsDefined started
   [info] Test run org.apache.spark.unsafe.PlatformUtilSuite finished: 0 
failed, 0 ignored, 9 total, 0.022s
   [info] Passed: Total 9, Failed 0, Errors 0, Passed 9
   [success] Total time: 7 s, completed Oct 13, 2023, 9:52:18 AM
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45426][CORE] Add support for a ReloadingX509TrustManager [spark]

2023-10-13 Thread via GitHub



mridulm closed pull request #43249: [SPARK-45426][CORE] Add support for a 
ReloadingX509TrustManager
URL: https://github.com/apache/spark/pull/43249


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45536][BUILD] Lower the default `-Xmx` of `build/mvn` to 3g [spark]

2023-10-13 Thread via GitHub



dongjoon-hyun commented on PR #43364:
URL: https://github.com/apache/spark/pull/43364#issuecomment-1761811406

   Thank you, @LuciferYang and all.
   
   Since Java 17 JVM GC is different than the old ParallelGC, we can optimize 
further.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45426][CORE] Add support for a ReloadingX509TrustManager [spark]

2023-10-13 Thread via GitHub



mridulm commented on PR #43249:
URL: https://github.com/apache/spark/pull/43249#issuecomment-1761811698

   The test failure is unrelated.
   Merged to master.
   Thanks for working on this @hasnain-db !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-44752][SQL] XML: Update Spark Docs [spark]

2023-10-13 Thread via GitHub



sandip-db commented on code in PR #43350:
URL: https://github.com/apache/spark/pull/43350#discussion_r1358449841


##
examples/src/main/resources/people.xml:
##
@@ -0,0 +1,15 @@
+
+
+
+Michael
+29
+
+
+Andy
+30
+
+
+Justin
+19
+
+

Review Comment:
   ```suggestion
   
   
   ```



##
docs/sql-data-sources-xml.md:
##
@@ -0,0 +1,224 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a 
file or directory of files in XML format into a Spark DataFrame,
+and `dataframe.write().xml("path")` to write to a xml file.
+When reading a XML file, the `rowTag` option need to be specified to indicate 
the XML element that maps to a `DataFrame row`. The option() function
+can be used to customize the behavior of reading or writing, such as 
controlling behavior of the XML attributes, XSD validation, compression, and so
+on.
+
+
+
+
+{% include_example xml_dataset 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+
+
+
+{% include_example xml_dataset 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+
+
+
+
+## Data Source Option
+
+Data source options of XML can be set via:
+
+* the `.option`/`.options` methods of
+* `DataFrameReader`
+* `DataFrameWriter`
+* `DataStreamReader`
+* `DataStreamWriter`
+* the built-in functions below
+* `from_xml`
+* `to_xml`
+* `schema_of_xml`
+* `OPTIONS` clause at [CREATE TABLE USING 
DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+
+  Property 
NameDefaultMeaningScope
+  
+rowTag
+ROW
+The row tag of your xml files to treat as a row. For example, in this 
xml:   ... the appropriate value would 
be book.
+read
+  
+
+  
+samplingRatio
+1.0
+Defines fraction of rows used for schema inferring. XML built-in 
functions ignore this option.
+read
+  
+
+  
+excludeAttribute
+false
+Whether to exclude attributes in elements.
+read
+  
+
+  
+mode
+PERMISSIVE
+Allows a mode for dealing with corrupt records during parsing.
+
+  PERMISSIVE: when it meets a corrupted record, puts the 
malformed string into a field configured by 
columnNameOfCorruptRecord, and sets malformed fields to 
null. To keep corrupt records, an user can set a string type field 
named columnNameOfCorruptRecord in an user-defined schema. If a 
schema does not have the field, it drops corrupt records during parsing. When 
inferring a schema, it implicitly adds a columnNameOfCorruptRecord 
field in an output schema.
+  DROPMALFORMED: ignores the whole corrupted records. 
This mode is unsupported in the JSON built-in functions.
+  FAILFAST: throws an exception when it meets corrupted 
records.
+
+
+read
+  
+
+  
+  inferSchema
+  true
+  If true, attempts to infer an appropriate type for each resulting 
DataFrame column. If false, all resulting columns are of string type. Default 
is true. XML built-in functions ignore this option.
+  read
+  
+
+  
+  columnNameOfCorruptRecord
+  spark.sql.columnNameOfCorruptRecord
+  Allows renaming the new field having a malformed string created by 
PERMISSIVE mode.
+  read
+  
+
+  
+attributePrefix
+_
+The prefix for attributes to differentiate attributes from elements. 
This will be the prefix for field names. Default is _. Can be empty for reading 
XML, but not for writing.
+read/write
+  
+
+  
+valueTag
+_VALUE
+The tag used for the value when there are attributes in the element 
having no child.
+read/write
+  
+
+  
+encoding
+UTF-8
+For reading, decodes the XML files by the given encoding type. For 
writing, specifies encoding (charset) of saved XML files. XML built-in 
functions ignore this option. 
+read/write
+  
+
+  
+ignoreSurroundingSpaces
+false
+Defines whether surrounding whitespaces from values being read should 
be skipped.
+read
+  
+
+  
+  rowValidationXSDPath
+  null
+  Path to an optional XSD file that is used to validate the XML

Re: [PR] [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled [spark]

2023-10-13 Thread via GitHub



asfgit closed pull request #43323: [SPARK-45495][core] Support stage level task 
resource profile for k8s cluster when dynamic allocation disabled
URL: https://github.com/apache/spark/pull/43323


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45495][core] Support stage level task resource profile for k8s cluster when dynamic allocation disabled [spark]

2023-10-13 Thread via GitHub



tgravescs commented on PR #43323:
URL: https://github.com/apache/spark/pull/43323#issuecomment-1761738892

   merged to master and branch-3.5.  Thanks @wbo4958 @mridulm 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP] Add support for java 17 and explicit Python versions from spark 3.5.0 onwards [spark-docker]

2023-10-13 Thread via GitHub



Yikun commented on code in PR #56:
URL: https://github.com/apache/spark-docker/pull/56#discussion_r1358357340


##
3.5.0/scala2.12-java11-python3-r-ubuntu/Dockerfile:
##
@@ -20,7 +20,10 @@ USER root
 
 RUN set -ex; \
 apt-get update; \
-apt-get install -y python3 python3-pip; \
+apt install -y software-properties-common; \
+add-apt-repository ppa:deadsnakes/ppa; \
+apt install python3.10; \

Review Comment:
   Is there any special reason why we use the python 3.10? I prefer to use os 
default python3 version from matainence cost view, and also os default python 
version has more stable quality and security.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP] Add support for java 17 and explicit Python versions from spark 3.5.0 onwards [spark-docker]

2023-10-13 Thread via GitHub



Yikun commented on code in PR #56:
URL: https://github.com/apache/spark-docker/pull/56#discussion_r1358357340


##
3.5.0/scala2.12-java11-python3-r-ubuntu/Dockerfile:
##
@@ -20,7 +20,10 @@ USER root
 
 RUN set -ex; \
 apt-get update; \
-apt-get install -y python3 python3-pip; \
+apt install -y software-properties-common; \
+add-apt-repository ppa:deadsnakes/ppa; \
+apt install python3.10; \

Review Comment:
   Is there any special reason why we use the python 3.10? I prefer to use os 
default python3 version from matainence cost view, and also os default python 
version has more stable quality.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45536][BUILD] Lower the default `-Xmx` of `build/mvn` to 3g [spark]

2023-10-13 Thread via GitHub



LuciferYang commented on PR #43364:
URL: https://github.com/apache/spark/pull/43364#issuecomment-1761659955

   Merge into master to observe the `Publish Snapshot` job, if it doesn't work, 
we can revert it tomorrow.
   
   Thanks @HyukjinKwon @beliefer @EnricoMi 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45536][BUILD] Lower the default `-Xmx` of `build/mvn` to 3g [spark]

2023-10-13 Thread via GitHub



LuciferYang closed pull request #43364: [SPARK-45536][BUILD] Lower the default 
`-Xmx` of `build/mvn` to 3g
URL: https://github.com/apache/spark/pull/43364


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45536][BUILD] Lower the default `-Xmx` of `build/mvn` to 3g [spark]

2023-10-13 Thread via GitHub



LuciferYang commented on PR #43364:
URL: https://github.com/apache/spark/pull/43364#issuecomment-1761654980

   ```
   [info] *** 1 TEST FAILED ***
   [error] Failed tests:
   [error]  org.apache.spark.sql.kafka010.KafkaSourceStressSuite
   [error] (sql-kafka-0-10 / Test / test) sbt.TestsFailedException: Tests 
unsuccessful
   [error] Total time: 2132 s (35:32), completed Oct 13, 2023, 12:07:56 PM
   ```
   
   Only `KafkaSourceStressSuite` test failed, this is a known flaky test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Add Support for Scala 2.13 in Spark 3.4.1 [spark-docker]

2023-10-13 Thread via GitHub



Yikun commented on code in PR #52:
URL: https://github.com/apache/spark-docker/pull/52#discussion_r1358346064


##
3.4.1/scala2.12-java11-ubuntu/entrypoint.sh:
##
@@ -77,6 +77,9 @@ elif ! [ -z "${SPARK_HOME+x}" ]; then
   SPARK_CLASSPATH="$SPARK_HOME/conf:$SPARK_CLASSPATH";
 fi
 
+# SPARK-43540: add current working directory into executor classpath

Review Comment:
   cc @HyukjinKwon @zhengruifeng @dongjoon-hyun 
   
   Or we could just apply master dockerfile changes in 3.4 images?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-40513] Switch 3.4.0 default Java to Java17 [spark-docker]

2023-10-13 Thread via GitHub



Yikun commented on PR #35:
URL: https://github.com/apache/spark-docker/pull/35#issuecomment-1761638285

   Replaced by https://github.com/apache/spark-docker/pull/56


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-40513] Switch 3.4.0 default Java to Java17 [spark-docker]

2023-10-13 Thread via GitHub



Yikun closed pull request #35: [SPARK-40513] Switch 3.4.0 default Java to Java17
URL: https://github.com/apache/spark-docker/pull/35


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP] Add support for java 17 and explicit Python versions from spark 3.5.0 onwards [spark-docker]

2023-10-13 Thread via GitHub



Yikun commented on code in PR #56:
URL: https://github.com/apache/spark-docker/pull/56#discussion_r1358361784


##
tools/template.py:
##
@@ -59,7 +59,7 @@ def parse_opts():
 parser.add_argument(
 "-j",
 "--java-version",
-help="The Spark version of Dockerfile.",
+help="Java version of Dockerfile.",

Review Comment:
   Good catch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP] Add support for java 17 and explicit Python versions from spark 3.5.0 onwards [spark-docker]

2023-10-13 Thread via GitHub



Yikun commented on code in PR #56:
URL: https://github.com/apache/spark-docker/pull/56#discussion_r1358361340


##
.github/workflows/test.yml:
##
@@ -37,12 +37,15 @@ on:
 - 3.3.0
   java:
 description: 'The Java version of Spark image.'
-default: 11
+default: "11"

Review Comment:
   Is it neccessary?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP] Add support for java 17 and explicit Python versions from spark 3.5.0 onwards [spark-docker]

2023-10-13 Thread via GitHub



Yikun commented on code in PR #56:
URL: https://github.com/apache/spark-docker/pull/56#discussion_r1358359041


##
add-dockerfiles.sh:
##
@@ -44,12 +48,20 @@ for TAG in $TAGS; do
 if echo $TAG | grep -q "r-"; then
 OPTS+=" --sparkr"
 fi
+
+if echo $TAG | grep -q "java17"; then
+OPTS+=" --java-version 17 --image eclipse-temurin:17-jre-jammy"
+fi
+if echo $TAG | grep -q "java11"; then

Review Comment:
   elif?



##
add-dockerfiles.sh:
##
@@ -44,12 +48,20 @@ for TAG in $TAGS; do
 if echo $TAG | grep -q "r-"; then
 OPTS+=" --sparkr"
 fi
+
+if echo $TAG | grep -q "java17"; then
+OPTS+=" --java-version 17 --image eclipse-temurin:17-jre-jammy"

Review Comment:
   Greate!



##
3.5.0/scala2.12-java11-python3-r-ubuntu/Dockerfile:
##
@@ -20,7 +20,10 @@ USER root
 
 RUN set -ex; \
 apt-get update; \
-apt-get install -y python3 python3-pip; \
+apt install -y software-properties-common; \
+add-apt-repository ppa:deadsnakes/ppa; \
+apt install python3.10; \

Review Comment:
   Is there any special reason why we use the python 3.10? I prefer to use os 
default python3 version from matainence cost view.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Add Support for Scala 2.13 in Spark 3.4.1 [spark-docker]

2023-10-13 Thread via GitHub



Yikun commented on code in PR #52:
URL: https://github.com/apache/spark-docker/pull/52#discussion_r1358343497


##
3.4.1/scala2.12-java11-ubuntu/entrypoint.sh:
##
@@ -77,6 +77,9 @@ elif ! [ -z "${SPARK_HOME+x}" ]; then
   SPARK_CLASSPATH="$SPARK_HOME/conf:$SPARK_CLASSPATH";
 fi
 
+# SPARK-43540: add current working directory into executor classpath

Review Comment:
   This is 3.5.x series, we might create a `entrypoint.sh.3.4.template`, 
`Dockerfile.3.4.template`.
   
   We also 
[change](https://github.com/apache/spark-docker/blob/master/add-dockerfiles.sh#L53)
 the `add-dockerfiles.sh`, For 3.x.1 version, if `3.x` templates exists, use 
3.x template, otherwise use the master entrypoint and Dockerfile directly.
   
   It can be a separate PR.



##
versions.json:
##
@@ -40,18 +40,41 @@
 "3.4.1"
   ]
 },
+{
+  "path": "3.4.1/scala2.13-java11-python3-ubuntu",
+  "tags": [
+"3.4.1-scala2.13-java11-python3-ubuntu"
+  ]
+},
 {
   "path": "3.4.1/scala2.12-java11-r-ubuntu",
   "tags": [
 "3.4.1-scala2.12-java11-r-ubuntu",
 "3.4.1-r"
   ]
 },
+{
+  "path": "3.4.1/scala2.13-java11-r-ubuntu",
+  "tags": [
+"3.4.1-scala2.13-java11-r-ubuntu"
+  ]
+},
 {
   "path": "3.4.1/scala2.12-java11-ubuntu",
   "tags": [
 "3.4.1-scala2.12-java11-ubuntu",
-"3.4.1-scala"
+"3.4.1-scala2.12",
+"scala2.12",
+"3.4.1-scala",
+"scala"

Review Comment:
   scala should be removed, because it's 3.5.0 now



##
3.4.1/scala2.12-java11-ubuntu/entrypoint.sh:
##
@@ -77,6 +77,9 @@ elif ! [ -z "${SPARK_HOME+x}" ]; then
   SPARK_CLASSPATH="$SPARK_HOME/conf:$SPARK_CLASSPATH";
 fi
 
+# SPARK-43540: add current working directory into executor classpath

Review Comment:
   cc @HyukjinKwon @zhengruifeng @dongjoon-hyun 
   
   Or we could just apply master dockerfile changes in 3.4-branches?



##
.github/workflows/build_3.4.1.yaml:
##
@@ -24,18 +24,18 @@ on:
 branches:
   - 'master'
 paths:
-  - '3.4.1/**'
+  - '3.4.1/scala2.**'
 
 jobs:
   run-build:
 strategy:
   matrix:
 image-type: ["all", "python", "scala", "r"]
+scala: ["2.13","2.12"]
 name: Run
 secrets: inherit
 uses: ./.github/workflows/main.yml
 with:
   spark: 3.4.1
-  scala: 2.12

Review Comment:
   This shouldn't be removed, so CI is break (not triggered).
   
   Try
   
   ```bash
   scala: ${{ matrix.scala }}
   ```
   
   [1] https://github.com/apache/spark-docker/actions/runs/6398139681/workflow



##
3.4.1/scala2.12-java11-ubuntu/entrypoint.sh:
##
@@ -90,6 +93,7 @@ case "$1" in
 CMD=(
   "$SPARK_HOME/bin/spark-submit"
   --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
+  --conf "spark.executorEnv.SPARK_DRIVER_POD_IP=$SPARK_DRIVER_BIND_ADDRESS"

Review Comment:
   ditto



##
versions.json:
##
@@ -40,18 +40,41 @@
 "3.4.1"
   ]
 },
+{
+  "path": "3.4.1/scala2.13-java11-python3-ubuntu",
+  "tags": [
+"3.4.1-scala2.13-java11-python3-ubuntu"
+  ]
+},
 {
   "path": "3.4.1/scala2.12-java11-r-ubuntu",
   "tags": [
 "3.4.1-scala2.12-java11-r-ubuntu",
 "3.4.1-r"
   ]
 },
+{
+  "path": "3.4.1/scala2.13-java11-r-ubuntu",
+  "tags": [
+"3.4.1-scala2.13-java11-r-ubuntu"
+  ]
+},
 {
   "path": "3.4.1/scala2.12-java11-ubuntu",
   "tags": [
 "3.4.1-scala2.12-java11-ubuntu",
-"3.4.1-scala"
+"3.4.1-scala2.12",
+"scala2.12",

Review Comment:
   scala2.12 should be removed, because it's 3.5.0 now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [WIP] Add support for java 17 and explicit Python versions from 3.5.0 [spark-docker]

2023-10-13 Thread via GitHub



vakarisbk opened a new pull request, #56:
URL: https://github.com/apache/spark-docker/pull/56

   
   ### What changes were proposed in this pull request?
   1. Create Java17 base images alongside Java11 images starting from spark 
3.5.0
   2. Add the ability to explicitly define Python versions 
   3. Change ubuntu version to 22.04 for `scala2.12-java17-*`
   
   ### Why are the changes needed?
   
   Spark supports multiple versions of Java and Spark and some community 
members have a need to use specific versions of Java and Python for their use 
cases. Adding this option would simplify workflows for these users and make 
Spark more accessible.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45468][UI] More transparent proxy handling for HTTP redirects [spark]

2023-10-13 Thread via GitHub



nsuke commented on PR #43297:
URL: https://github.com/apache/spark/pull/43297#issuecomment-1761482605

   Hi @yaooqinn @dongjoon-hyun @srowen could you review this? This makes Web UI 
proxy setup simpler.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]

2023-10-13 Thread via GitHub



cloud-fan commented on code in PR #41855:
URL: https://github.com/apache/spark/pull/41855#discussion_r1358221878


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##
@@ -114,22 +115,19 @@ object JdbcUtils extends Logging with SQLConfHelper {
   isCaseSensitive: Boolean,
   dialect: JdbcDialect): String = {

Review Comment:
   note: `getInsertStatement` is not a public API but we don't stop people from 
calling `JdbcUtils` as there is no `private[spark]`. I'm fine keeping it to 
avoid troubles for third-party Spark vendors.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-45531][SQL][DOCS] Add more comments and rename some variable name for InjectRuntimeFilter [spark]

2023-10-13 Thread via GitHub



cloud-fan commented on PR #43359:
URL: https://github.com/apache/spark/pull/43359#issuecomment-1761466792

   what variables did you rename?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-44752][SQL] XML: Update Spark Docs [spark]

2023-10-13 Thread via GitHub



beliefer commented on PR #43350:
URL: https://github.com/apache/spark/pull/43350#issuecomment-1761452245

   > Thank you very much for your help
   
   Please refer https://github.com/apache/spark/blob/master/docs/README.md


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]

2023-10-13 Thread via GitHub



Hisoka-X commented on code in PR #41855:
URL: https://github.com/apache/spark/pull/41855#discussion_r1358194353


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##
@@ -104,34 +105,6 @@ object JdbcUtils extends Logging with SQLConfHelper {
 JdbcDialects.get(url).isCascadingTruncateTable()
   }
 
-  /**
-   * Returns an Insert SQL statement for inserting a row into the target table 
via JDBC conn.
-   */
-  def getInsertStatement(

Review Comment:
   addressed all.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]

2023-10-13 Thread via GitHub



beliefer commented on code in PR #41855:
URL: https://github.com/apache/spark/pull/41855#discussion_r1358189908


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##
@@ -670,7 +643,6 @@ object JdbcUtils extends Logging with SQLConfHelper {
* updated even with error if it doesn't support transaction, as there're 
dirty outputs.
*/
   def savePartition(
-  table: String,

Review Comment:
   Yes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]

2023-10-13 Thread via GitHub



Hisoka-X commented on code in PR #41855:
URL: https://github.com/apache/spark/pull/41855#discussion_r1358187351


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##
@@ -670,7 +643,6 @@ object JdbcUtils extends Logging with SQLConfHelper {
* updated even with error if it doesn't support transaction, as there're 
dirty outputs.
*/
   def savePartition(
-  table: String,

Review Comment:
   Should I revert this as well? @beliefer 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-45531][SQL][DOCS] Add more comments and rename some variable name for InjectRuntimeFilter [spark]

2023-10-13 Thread via GitHub



beliefer commented on PR #43359:
URL: https://github.com/apache/spark/pull/43359#issuecomment-1761418972

   The failure is unrelated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]

2023-10-13 Thread via GitHub



beliefer commented on code in PR #41855:
URL: https://github.com/apache/spark/pull/41855#discussion_r1358154414


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##
@@ -104,34 +105,6 @@ object JdbcUtils extends Logging with SQLConfHelper {
 JdbcDialects.get(url).isCascadingTruncateTable()
   }
 
-  /**
-   * Returns an Insert SQL statement for inserting a row into the target table 
via JDBC conn.
-   */
-  def getInsertStatement(

Review Comment:
   This is a public API too. Shall we refactor `getInsertStatement` by use 
`dialect.insertIntoTable` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]

2023-10-13 Thread via GitHub



Hisoka-X commented on code in PR #41855:
URL: https://github.com/apache/spark/pull/41855#discussion_r1358145882


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##
@@ -854,7 +854,6 @@ object JdbcUtils extends Logging with SQLConfHelper {
   def saveTable(
   df: DataFrame,
   tableSchema: Option[StructType],
-  isCaseSensitive: Boolean,

Review Comment:
   done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]

2023-10-13 Thread via GitHub



beliefer commented on code in PR #41855:
URL: https://github.com/apache/spark/pull/41855#discussion_r1358121363


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##
@@ -854,7 +854,6 @@ object JdbcUtils extends Logging with SQLConfHelper {
   def saveTable(
   df: DataFrame,
   tableSchema: Option[StructType],
-  isCaseSensitive: Boolean,

Review Comment:
   This is a public API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45486][CONNECT] Make add_artifact request idempotent [spark]

2023-10-13 Thread via GitHub



cdkrot commented on code in PR #43314:
URL: https://github.com/apache/spark/pull/43314#discussion_r1358099191


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/artifact/SparkConnectArtifactManager.scala:
##
@@ -82,6 +83,40 @@ class SparkConnectArtifactManager(sessionHolder: 
SessionHolder) extends Logging
*/
   def getSparkConnectPythonIncludes: Seq[String] = 
pythonIncludeList.asScala.toSeq
 
+  private def areFilesEqual(path1: Path, path2: Path): Boolean = {

Review Comment:
   Added this, thanks for the feedback!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45486][CONNECT] Make add_artifact request idempotent [spark]

2023-10-13 Thread via GitHub



cdkrot commented on code in PR #43314:
URL: https://github.com/apache/spark/pull/43314#discussion_r1358090333


##
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/artifact/SparkConnectArtifactManager.scala:
##
@@ -82,6 +83,40 @@ class SparkConnectArtifactManager(sessionHolder: 
SessionHolder) extends Logging
*/
   def getSparkConnectPythonIncludes: Seq[String] = 
pythonIncludeList.asScala.toSeq
 
+  private def areFilesEqual(path1: Path, path2: Path): Boolean = {

Review Comment:
   That sounds good, let me see if we can use it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]

2023-10-13 Thread via GitHub



Hisoka-X commented on code in PR #41855:
URL: https://github.com/apache/spark/pull/41855#discussion_r1358087806


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##
@@ -854,7 +854,6 @@ object JdbcUtils extends Logging with SQLConfHelper {
   def saveTable(
   df: DataFrame,
   tableSchema: Option[StructType],
-  isCaseSensitive: Boolean,

Review Comment:
   Is there any special reason?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44262][SQL] Add `dropTable` and `getInsertStatement` to JdbcDialect [spark]

2023-10-13 Thread via GitHub



beliefer commented on code in PR #41855:
URL: https://github.com/apache/spark/pull/41855#discussion_r1358085152


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##
@@ -854,7 +854,6 @@ object JdbcUtils extends Logging with SQLConfHelper {
   def saveTable(
   df: DataFrame,
   tableSchema: Option[StructType],
-  isCaseSensitive: Boolean,

Review Comment:
   Please not change the signature. We still not use it is OK.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 139 matches

Mail list logo