date:20231115

Re: [PR] [SPARK-32246][BUILD][INFRA] Add new Github Action to run Kinesis tests [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on code in PR #43736:
URL: https://github.com/apache/spark/pull/43736#discussion_r1395285496


##
.github/workflows/build_and_test.yml:
##
@@ -555,6 +555,81 @@ jobs:
   with:
 name: test-results-sparkr--${{ inputs.java }}-${{ inputs.hadoop 
}}-hive2.3
 path: "**/target/test-reports/*.xml"
+
+  kinesis-asl:

Review Comment:
   BTW, do we need to add a new pipeline? If this is a small test, we can 
append this to the existing pipeline.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-32246][BUILD][INFRA] Add new Github Action to run Kinesis tests [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on code in PR #43736:
URL: https://github.com/apache/spark/pull/43736#discussion_r1395285496


##
.github/workflows/build_and_test.yml:
##
@@ -555,6 +555,81 @@ jobs:
   with:
 name: test-results-sparkr--${{ inputs.java }}-${{ inputs.hadoop 
}}-hive2.3
 path: "**/target/test-reports/*.xml"
+
+  kinesis-asl:

Review Comment:
   BTW, do we need to add a new pipeline?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-32246][BUILD][INFRA] Add new Github Action to run Kinesis tests [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on code in PR #43736:
URL: https://github.com/apache/spark/pull/43736#discussion_r1395282977


##
pom.xml:
##
@@ -202,6 +202,7 @@
 4.1.17
 14.0.1
 3.1.9
+2.2.11

Review Comment:
   You can spin off this together with 
https://github.com/apache/spark/pull/43736/files#r1395282370



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-32246][BUILD][INFRA] Add new Github Action to run Kinesis tests [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on code in PR #43736:
URL: https://github.com/apache/spark/pull/43736#discussion_r1395282370


##
connector/kinesis-asl/pom.xml:
##
@@ -76,6 +76,12 @@
   jackson-dataformat-cbor
   ${fasterxml.jackson.version}
 
+
+  javax.xml.bind
+  jaxb-api
+  ${jaxb-api.version}
+  test
+

Review Comment:
   If we need this for testing, this looks like an independent issue. Could you 
spin off from this GitHub Action PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-32246][BUILD][INFRA] Add new Github Action to run Kinesis tests [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on code in PR #43736:
URL: https://github.com/apache/spark/pull/43736#discussion_r1395281539


##
.github/workflows/build_and_test.yml:
##
@@ -1049,7 +1124,7 @@ jobs:
   sudo install minikube-linux-amd64 /usr/local/bin/minikube
   rm minikube-linux-amd64
   # Github Action limit cpu:2, memory: 6947MB, limit to 2U6G for 
better resource statistic
-  minikube start --cpus 2 --memory 6144
+  minikube start --cpus 2 --memory 6144 --force

Review Comment:
   Why do you touch `k8s-integration-tests` in `kinesis` PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45938][INFRA] Add `utils` to the dependencies of the `core/unsafe/network_common` module in `module.py` [spark]

2023-11-15 Thread via GitHub



LuciferYang commented on PR #43818:
URL: https://github.com/apache/spark/pull/43818#issuecomment-1813946114

   Thanks @zhengruifeng 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45938][INFRA] Add `utils` to the dependencies of the `core/unsafe/network_common` module in `module.py` [spark]

2023-11-15 Thread via GitHub



LuciferYang closed pull request #43818: [SPARK-45938][INFRA] Add `utils` to the 
dependencies of the `core/unsafe/network_common` module in `module.py`
URL: https://github.com/apache/spark/pull/43818


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-32246][BUILD][INFRA] Add new Github Action to run Kinesis tests [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on code in PR #43736:
URL: https://github.com/apache/spark/pull/43736#discussion_r1395280300


##
.github/workflows/build_and_test.yml:
##
@@ -555,6 +555,81 @@ jobs:
   with:
 name: test-results-sparkr--${{ inputs.java }}-${{ inputs.hadoop 
}}-hive2.3
 path: "**/target/test-reports/*.xml"
+
+  kinesis-asl:
+needs: [precondition, infra-image]
+# always run if sparkr == 'true', even infra-image is skip (such as 
non-master job)
+#if: (!cancelled()) && 
fromJson(needs.precondition.outputs.required).sparkr == 'true'

Review Comment:
   ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45948][K8S] Make single-pod spark jobs respect `spark.app.id` [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on PR #43833:
URL: https://github.com/apache/spark/pull/43833#issuecomment-1813937357

   Could you review this when you have some time, please, @LuciferYang ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][INFRA] Test PyArrow 14 [spark]

2023-11-15 Thread via GitHub



zhengruifeng commented on PR #43829:
URL: https://github.com/apache/spark/pull/43829#issuecomment-1813930288

   ```
   pyarrow  14.0.1
   pydantic 2.5.1
   pydantic_core2.14.3
   PyGObject3.36.0
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45948][K8S] Make single-pod spark jobs respect `spark.app.id` [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun opened a new pull request, #43833:
URL: https://github.com/apache/spark/pull/43833

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45946][SS] Fix use of deprecated FileUtils write to pass default charset in RocksDBSuite [spark]

2023-11-15 Thread via GitHub



anishshri-db commented on PR #43832:
URL: https://github.com/apache/spark/pull/43832#issuecomment-1813879801

   cc - @HeartSaVioR - PTAL, thx


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45946] Fix use of deprecated FileUtils write to pass default charset in RocksDBSuite [spark]

2023-11-15 Thread via GitHub



anishshri-db opened a new pull request, #43832:
URL: https://github.com/apache/spark/pull/43832

   ### What changes were proposed in this pull request?
   Fix use of deprecated FileUtils write to pass default charset in RocksDBSuite
   
   
   ### Why are the changes needed?
   Without the change, we were getting this compilation warning
   ```
   [warn] 
/Users/anish.shrigondekar/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala:854:17:
 method write in class FileUtils is deprecated
   [warn] Applicable -Wconf / @nowarn filters for this warning: msg=, cat=deprecation, 
site=org.apache.spark.sql.execution.streaming.state.RocksDBSuite, 
origin=org.apache.commons.io.FileUtils.write
   [warn]   FileUtils.write(file2, s"v2\n$json2")
   [warn] ^
   [warn] 
/Users/anish.shrigondekar/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala:1272:17:
 method write in class FileUtils is deprecated
   [warn] Applicable -Wconf / @nowarn filters for this warning: msg=, cat=deprecation, 
site=org.apache.spark.sql.execution.streaming.state.RocksDBSuite.generateFiles.$anonfun,
 origin=org.apache.commons.io.FileUtils.write
   [warn]   FileUtils.write(file, "a" * length)
   [warn]
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Ran test suite
   
   ```
   22:47:45.700 WARN 
org.apache.spark.sql.execution.streaming.state.RocksDBSuite:
   
   = POSSIBLE THREAD LEAK IN SUITE 
o.a.s.sql.execution.streaming.state.RocksDBSuite, threads: 
ForkJoinPool.commonPool-worker-6 (daemon=true), 
ForkJoinPool.commonPool-worker-4 (daemon=true), rpc-boss-3-1 (daemon=true), 
ForkJoinPool.commonPool-worker-5 (daemon=true), 
ForkJoinPool.commonPool-worker-3 (daemon=true), 
ForkJoinPool.commonPool-worker-2 (daemon=true), shuffle-boss-6-1 (daemon=true), 
ForkJoinPool.commonPool-worker-1 (daemon=true) =
   [info] Run completed in 1 minute, 55 seconds.
   [info] Total number of tests run: 77
   [info] Suites: completed 1, aborted 0
   [info] Tests: succeeded 77, failed 0, canceled 0, ignored 0, pending 0
   [info] All tests passed.
   [success] Total time: 172 s (02:52), completed Nov 15, 2023, 10:47:46 PM
   ```
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2023-11-15 Thread via GitHub



wbo4958 commented on code in PR #43494:
URL: https://github.com/apache/spark/pull/43494#discussion_r1395211436


##
core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala:
##
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import scala.collection.mutable.HashMap
+
+import org.apache.spark.SparkException
+import org.apache.spark.resource.ResourceAmountUtils.RESOURCE_TOTAL_AMOUNT
+import org.apache.spark.resource.ResourceProfile
+
+/**
+ * Class to hold information about a series of resources belonging to an 
executor.
+ * A resource could be a GPU, FPGA, etc. And it is used as a temporary
+ * class to calculate the resources amounts when offering resources to
+ * the tasks in the [[TaskSchedulerImpl]]
+ *
+ * One example is GPUs, where the addresses would be the indices of the GPUs
+ *
+ * @param resources The executor available resources and amount. eg,
+ *  Map("gpu" -> Map("0" -> 0.2*RESOURCE_TOTAL_AMOUNT,

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2023-11-15 Thread via GitHub



wbo4958 commented on code in PR #43494:
URL: https://github.com/apache/spark/pull/43494#discussion_r1395204414


##
core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala:
##
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import scala.collection.mutable.HashMap
+
+import org.apache.spark.SparkException
+import org.apache.spark.resource.ResourceAmountUtils.RESOURCE_TOTAL_AMOUNT
+import org.apache.spark.resource.ResourceProfile
+
+/**
+ * Class to hold information about a series of resources belonging to an 
executor.
+ * A resource could be a GPU, FPGA, etc. And it is used as a temporary
+ * class to calculate the resources amounts when offering resources to
+ * the tasks in the [[TaskSchedulerImpl]]
+ *
+ * One example is GPUs, where the addresses would be the indices of the GPUs
+ *
+ * @param resources The executor available resources and amount. eg,
+ *  Map("gpu" -> Map("0" -> 0.2*RESOURCE_TOTAL_AMOUNT,
+ *   "1" -> 1.0*RESOURCE_TOTAL_AMOUNT),
+ *  "fpga" -> Map("a" -> 0.3*RESOURCE_TOTAL_AMOUNT,
+ *"b" -> 0.9*RESOURCE_TOTAL_AMOUNT)
+ *  )
+ */
+private[spark] class ExecutorResourcesAmounts(
+private val resources: Map[String, Map[String, Long]]) extends 
Serializable {
+
+  /**
+   * convert the resources to be mutable HashMap
+   */
+  private val internalResources: Map[String, HashMap[String, Long]] = {
+resources.map { case (rName, addressAmounts) =>
+  rName -> HashMap(addressAmounts.toSeq: _*)
+}
+  }
+
+  /**
+   * The total address count of each resource. Eg,
+   * Map("gpu" -> Map("0" -> 0.5 * RESOURCE_TOTAL_AMOUNT,
+   *  "1" -> 0.5 * RESOURCE_TOTAL_AMOUNT,
+   *  "2" -> 0.5 * RESOURCE_TOTAL_AMOUNT),
+   * "fpga" -> Map("a" -> 0.5 * RESOURCE_TOTAL_AMOUNT,
+   *   "b" -> 0.5 * RESOURCE_TOTAL_AMOUNT))
+   * the resourceAmount will be Map("gpu" -> 3, "fpga" -> 2)
+   */
+  lazy val resourceAmount: Map[String, Int] = internalResources.map { case 
(rName, addressMap) =>

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2023-11-15 Thread via GitHub



wbo4958 commented on code in PR #43494:
URL: https://github.com/apache/spark/pull/43494#discussion_r1395200464


##
core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala:
##
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import scala.collection.mutable.HashMap
+
+import org.apache.spark.SparkException
+import org.apache.spark.resource.ResourceAmountUtils.RESOURCE_TOTAL_AMOUNT
+import org.apache.spark.resource.ResourceProfile
+
+/**
+ * Class to hold information about a series of resources belonging to an 
executor.
+ * A resource could be a GPU, FPGA, etc. And it is used as a temporary
+ * class to calculate the resources amounts when offering resources to
+ * the tasks in the [[TaskSchedulerImpl]]
+ *
+ * One example is GPUs, where the addresses would be the indices of the GPUs
+ *
+ * @param resources The executor available resources and amount. eg,
+ *  Map("gpu" -> Map("0" -> 0.2*RESOURCE_TOTAL_AMOUNT,
+ *   "1" -> 1.0*RESOURCE_TOTAL_AMOUNT),
+ *  "fpga" -> Map("a" -> 0.3*RESOURCE_TOTAL_AMOUNT,
+ *"b" -> 0.9*RESOURCE_TOTAL_AMOUNT)
+ *  )
+ */
+private[spark] class ExecutorResourcesAmounts(
+private val resources: Map[String, Map[String, Long]]) extends 
Serializable {
+
+  /**
+   * convert the resources to be mutable HashMap
+   */
+  private val internalResources: Map[String, HashMap[String, Long]] = {
+resources.map { case (rName, addressAmounts) =>
+  rName -> HashMap(addressAmounts.toSeq: _*)
+}
+  }
+
+  /**
+   * The total address count of each resource. Eg,
+   * Map("gpu" -> Map("0" -> 0.5 * RESOURCE_TOTAL_AMOUNT,
+   *  "1" -> 0.5 * RESOURCE_TOTAL_AMOUNT,
+   *  "2" -> 0.5 * RESOURCE_TOTAL_AMOUNT),
+   * "fpga" -> Map("a" -> 0.5 * RESOURCE_TOTAL_AMOUNT,
+   *   "b" -> 0.5 * RESOURCE_TOTAL_AMOUNT))
+   * the resourceAmount will be Map("gpu" -> 3, "fpga" -> 2)
+   */
+  lazy val resourceAmount: Map[String, Int] = internalResources.map { case 
(rName, addressMap) =>
+rName -> addressMap.size
+  }
+
+  /**
+   * For testing purpose. convert internal resources back to the "fraction" 
resources.
+   */
+  private[spark] def availableResources: Map[String, Map[String, Double]] = {
+internalResources.map { case (rName, addressMap) =>
+  rName -> addressMap.map { case (address, amount) =>
+address -> amount.toDouble / RESOURCE_TOTAL_AMOUNT
+  }.toMap
+}
+  }
+
+  /**
+   * Acquire the resource and update the resource
+   * @param assignedResource the assigned resource information
+   */
+  def acquire(assignedResource: Map[String, Map[String, Long]]): Unit = {
+assignedResource.foreach { case (rName, taskResAmounts) =>
+  val availableResourceAmounts = internalResources.getOrElse(rName,
+throw new SparkException(s"Try to acquire an address from $rName that 
doesn't exist"))
+  taskResAmounts.foreach { case (address, amount) =>
+val prevInternalTotalAmount = 
availableResourceAmounts.getOrElse(address,
+  throw new SparkException(s"Try to acquire an address that doesn't 
exist. $rName " +
+s"address $address doesn't exist."))
+
+val left = prevInternalTotalAmount - amount
+if (left < 0) {
+  throw new SparkException(s"The total amount ${left.toDouble / 
RESOURCE_TOTAL_AMOUNT} " +
+s"after acquiring $rName address $address should be >= 0")
+}
+internalResources(rName)(address) = left
+  }
+}
+  }
+
+  /**
+   * Release the assigned resources to the resource pool
+   * @param assignedResource resource to be released
+   */
+  def release(assignedResource: Map[String, Map[String, Long]]): Unit = {
+assignedResource.foreach { case (rName, taskResAmounts) =>
+  val availableResourceAmounts = internalResources.getOrElse(rName,
+throw new SparkException(s"Try to release an address from $rName that 
doesn't exist"))
+  taskResAmounts.foreach { case (address, amount) =>
+val prevInternalTotalAmount =

Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2023-11-15 Thread via GitHub



wbo4958 commented on code in PR #43494:
URL: https://github.com/apache/spark/pull/43494#discussion_r1395199672


##
core/src/main/scala/org/apache/spark/scheduler/ExecutorResourcesAmounts.scala:
##
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import scala.collection.mutable.HashMap
+
+import org.apache.spark.SparkException
+import org.apache.spark.resource.ResourceAmountUtils.RESOURCE_TOTAL_AMOUNT
+import org.apache.spark.resource.ResourceProfile
+
+/**
+ * Class to hold information about a series of resources belonging to an 
executor.
+ * A resource could be a GPU, FPGA, etc. And it is used as a temporary
+ * class to calculate the resources amounts when offering resources to
+ * the tasks in the [[TaskSchedulerImpl]]
+ *
+ * One example is GPUs, where the addresses would be the indices of the GPUs
+ *
+ * @param resources The executor available resources and amount. eg,
+ *  Map("gpu" -> Map("0" -> 0.2*RESOURCE_TOTAL_AMOUNT,
+ *   "1" -> 1.0*RESOURCE_TOTAL_AMOUNT),
+ *  "fpga" -> Map("a" -> 0.3*RESOURCE_TOTAL_AMOUNT,
+ *"b" -> 0.9*RESOURCE_TOTAL_AMOUNT)
+ *  )
+ */
+private[spark] class ExecutorResourcesAmounts(
+private val resources: Map[String, Map[String, Long]]) extends 
Serializable {
+
+  /**
+   * convert the resources to be mutable HashMap
+   */
+  private val internalResources: Map[String, HashMap[String, Long]] = {
+resources.map { case (rName, addressAmounts) =>
+  rName -> HashMap(addressAmounts.toSeq: _*)
+}
+  }
+
+  /**
+   * The total address count of each resource. Eg,
+   * Map("gpu" -> Map("0" -> 0.5 * RESOURCE_TOTAL_AMOUNT,
+   *  "1" -> 0.5 * RESOURCE_TOTAL_AMOUNT,
+   *  "2" -> 0.5 * RESOURCE_TOTAL_AMOUNT),
+   * "fpga" -> Map("a" -> 0.5 * RESOURCE_TOTAL_AMOUNT,
+   *   "b" -> 0.5 * RESOURCE_TOTAL_AMOUNT))
+   * the resourceAmount will be Map("gpu" -> 3, "fpga" -> 2)
+   */
+  lazy val resourceAmount: Map[String, Int] = internalResources.map { case 
(rName, addressMap) =>
+rName -> addressMap.size
+  }
+
+  /**
+   * For testing purpose. convert internal resources back to the "fraction" 
resources.
+   */
+  private[spark] def availableResources: Map[String, Map[String, Double]] = {
+internalResources.map { case (rName, addressMap) =>
+  rName -> addressMap.map { case (address, amount) =>
+address -> amount.toDouble / RESOURCE_TOTAL_AMOUNT
+  }.toMap
+}
+  }
+
+  /**
+   * Acquire the resource and update the resource
+   * @param assignedResource the assigned resource information
+   */
+  def acquire(assignedResource: Map[String, Map[String, Long]]): Unit = {
+assignedResource.foreach { case (rName, taskResAmounts) =>
+  val availableResourceAmounts = internalResources.getOrElse(rName,
+throw new SparkException(s"Try to acquire an address from $rName that 
doesn't exist"))
+  taskResAmounts.foreach { case (address, amount) =>
+val prevInternalTotalAmount = 
availableResourceAmounts.getOrElse(address,
+  throw new SparkException(s"Try to acquire an address that doesn't 
exist. $rName " +
+s"address $address doesn't exist."))
+
+val left = prevInternalTotalAmount - amount
+if (left < 0) {
+  throw new SparkException(s"The total amount ${left.toDouble / 
RESOURCE_TOTAL_AMOUNT} " +
+s"after acquiring $rName address $address should be >= 0")
+}
+internalResources(rName)(address) = left
+  }
+}
+  }
+
+  /**
+   * Release the assigned resources to the resource pool
+   * @param assignedResource resource to be released
+   */
+  def release(assignedResource: Map[String, Map[String, Long]]): Unit = {
+assignedResource.foreach { case (rName, taskResAmounts) =>
+  val availableResourceAmounts = internalResources.getOrElse(rName,
+throw new SparkException(s"Try to release an address from $rName that 
doesn't exist"))
+  taskResAmounts.foreach { case (address, amount) =>
+val prevInternalTotalAmount =

Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2023-11-15 Thread via GitHub



wbo4958 commented on code in PR #43494:
URL: https://github.com/apache/spark/pull/43494#discussion_r1395199364


##
core/src/main/scala/org/apache/spark/resource/ResourceAllocator.scala:
##
@@ -20,6 +20,42 @@ package org.apache.spark.resource
 import scala.collection.mutable
 
 import org.apache.spark.SparkException
+import org.apache.spark.resource.ResourceAmountUtils.RESOURCE_TOTAL_AMOUNT
+
+private[spark] object ResourceAmountUtils {
+  /**
+   * Using "double" to do the resource calculation may encounter a problem of 
precision loss. Eg
+   *
+   * scala val taskAmount = 1.0 / 9
+   * taskAmount: Double = 0.
+   *
+   * scala var total = 1.0
+   * total: Double = 1.0
+   *
+   * scala for (i - 1 to 9 ) {
+   * |   if (total = taskAmount) {
+   * |   total -= taskAmount
+   * |   println(s"assign $taskAmount for task $i, total left: $total")
+   * |   } else {
+   * |   println(s"ERROR Can't assign $taskAmount for task $i, total 
left: $total")
+   * |   }
+   * | }
+   * assign 0. for task 1, total left: 0.
+   * assign 0. for task 2, total left: 0.
+   * assign 0. for task 3, total left: 0.6665
+   * assign 0. for task 4, total left: 0.5554
+   * assign 0. for task 5, total left: 0.44425
+   * assign 0. for task 6, total left: 0.33315
+   * assign 0. for task 7, total left: 0.22204
+   * assign 0. for task 8, total left: 0.11094
+   * ERROR Can't assign 0. for task 9, total left: 
0.11094
+   *
+   * So we multiply RESOURCE_TOTAL_AMOUNT to convert the double to long to 
avoid this limitation.
+   * Double can display up to 16 decimal places, so we set the factor to
+   * 10, 000, 000, 000, 000, 000L.
+   */
+  final val RESOURCE_TOTAL_AMOUNT: Long = 1L

Review Comment:
   Really good suggestion. Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2023-11-15 Thread via GitHub



wbo4958 commented on code in PR #43494:
URL: https://github.com/apache/spark/pull/43494#discussion_r1395199162


##
core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala:
##
@@ -191,7 +191,10 @@ private[spark] class CoarseGrainedExecutorBackend(
   } else {
 val taskDesc = TaskDescription.decode(data.value)
 logInfo("Got assigned task " + taskDesc.taskId)
-taskResources.put(taskDesc.taskId, taskDesc.resources)
+// Convert resources amounts into ResourceInformation
+val resources = taskDesc.resources.map { case (rName, 
addressesAmounts) =>
+  rName -> new ResourceInformation(rName, 
addressesAmounts.keys.toSeq.sorted.toArray)}
+taskResources.put(taskDesc.taskId, resources)

Review Comment:
   Sounds good. new commits have removed the taskResources



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45927][PYTHON] Update path handling for Python data source [spark]

2023-11-15 Thread via GitHub



allisonwang-db commented on code in PR #43809:
URL: https://github.com/apache/spark/pull/43809#discussion_r1395186286


##
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##
@@ -246,7 +246,15 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 val builder = 
sparkSession.sharedState.dataSourceManager.lookupDataSource(source)
 // Unless the legacy path option behavior is enabled, the extraOptions here
 // should not include "path" or "paths" as keys.
-val plan = builder(sparkSession, source, paths, userSpecifiedSchema, 
extraOptions)
+// Add path to the options field. Note currently it only supports a single 
path.
+val optionsWithPath = if (paths.isEmpty) {
+  extraOptions
+} else if (paths.length == 1) {
+extraOptions + ("path" -> paths.head)
+} else {
+  throw QueryCompilationErrors.multiplePathsUnsupportedError(source, paths)

Review Comment:
   Yea, let's just follow the DSv2 approach (options['paths'] = json serialized 
string list) to make Python data source behave the same as DSv2. I will update 
this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45511] Fix state reader suite flakiness by clean up resources after each test run [spark]

2023-11-15 Thread via GitHub



chaoqin-li1123 commented on PR #43831:
URL: https://github.com/apache/spark/pull/43831#issuecomment-1813830603

   @HeartSaVioR 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45511] fix state reader suite flakiness by clean up resources after each test run [spark]

2023-11-15 Thread via GitHub



chaoqin-li1123 opened a new pull request, #43831:
URL: https://github.com/apache/spark/pull/43831

   
   
   ### What changes were proposed in this pull request?
   Fix state reader suite flakiness by clean up resources after each test. 
Because all state store instance share the same maintainence task pool, the 
failed maintainence task from previous test run may affect future test runs and 
cause test failure. Clean up StateStore explicitly to unflake the test.
   
   
   ### Why are the changes needed?
   To unflake the test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-15 Thread via GitHub



panbingkun commented on PR #37588:
URL: https://github.com/apache/spark/pull/37588#issuecomment-1813824544

   > thanks, merging to master!
   
   Thank you again for your great help! ❤️❤️❤️


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45927][PYTHON] Update path handling for Python data source [spark]

2023-11-15 Thread via GitHub



cloud-fan commented on code in PR #43809:
URL: https://github.com/apache/spark/pull/43809#discussion_r1395160860


##
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##
@@ -246,7 +246,15 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 val builder = 
sparkSession.sharedState.dataSourceManager.lookupDataSource(source)
 // Unless the legacy path option behavior is enabled, the extraOptions here
 // should not include "path" or "paths" as keys.
-val plan = builder(sparkSession, source, paths, userSpecifiedSchema, 
extraOptions)
+// Add path to the options field. Note currently it only supports a single 
path.
+val optionsWithPath = if (paths.isEmpty) {
+  extraOptions
+} else if (paths.length == 1) {
+extraOptions + ("path" -> paths.head)
+} else {
+  throw QueryCompilationErrors.multiplePathsUnsupportedError(source, paths)

Review Comment:
   does it help to add a `paths` option using JSON to hold String[]?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45927][PYTHON] Update path handling for Python data source [spark]

2023-11-15 Thread via GitHub



cloud-fan commented on code in PR #43809:
URL: https://github.com/apache/spark/pull/43809#discussion_r1395160402


##
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##
@@ -246,7 +246,15 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
 val builder = 
sparkSession.sharedState.dataSourceManager.lookupDataSource(source)
 // Unless the legacy path option behavior is enabled, the extraOptions here
 // should not include "path" or "paths" as keys.
-val plan = builder(sparkSession, source, paths, userSpecifiedSchema, 
extraOptions)
+// Add path to the options field. Note currently it only supports a single 
path.
+val optionsWithPath = if (paths.isEmpty) {
+  extraOptions
+} else if (paths.length == 1) {
+extraOptions + ("path" -> paths.head)

Review Comment:
   ```suggestion
 extraOptions + ("path" -> paths.head)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-15 Thread via GitHub



cloud-fan closed pull request #37588: [SPARK-33393][SQL] Support SHOW TABLE 
EXTENDED in v2
URL: https://github.com/apache/spark/pull/37588


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-15 Thread via GitHub



cloud-fan commented on PR #37588:
URL: https://github.com/apache/spark/pull/37588#issuecomment-1813802035

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45764][PYTHON][DOCS][3.3] Make code block copyable [spark]

2023-11-15 Thread via GitHub



panbingkun opened a new pull request, #43830:
URL: https://github.com/apache/spark/pull/43830

   ### What changes were proposed in this pull request?
   The pr aims to make code block `copyable `in pyspark docs.
   Backport above to `branch 3.3`.
   Master branch pr: https://github.com/apache/spark/pull/43799
   
   
   ### Why are the changes needed?
   Improving the usability of PySpark documents.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, users will be able to easily copy code block in pyspark docs.
   
   
   ### How was this patch tested?
   - Manually test.
   - Pass GA.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [WIP][INFRA] Test PyArrow 14 [spark]

2023-11-15 Thread via GitHub



zhengruifeng opened a new pull request, #43829:
URL: https://github.com/apache/spark/pull/43829

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45764][PYTHON][DOCS][3.4] Make code block copyable [spark]

2023-11-15 Thread via GitHub



panbingkun opened a new pull request, #43828:
URL: https://github.com/apache/spark/pull/43828

   ### What changes were proposed in this pull request?
   The pr aims to make code block `copyable `in pyspark docs.
   Backport above to `branch 3.4`.
   Master branch pr: https://github.com/apache/spark/pull/43799
   
   
   ### Why are the changes needed?
   Improving the usability of PySpark documents.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, users will be able to easily copy code block in pyspark docs.
   
   
   ### How was this patch tested?
   - Manually test.
   - Pass GA.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45747][SS] Use prefix key information in state metadata to handle reading state for session window aggregation [spark]

2023-11-15 Thread via GitHub



HeartSaVioR closed pull request #43788: [SPARK-45747][SS] Use prefix key 
information in state metadata to handle reading state for session window 
aggregation
URL: https://github.com/apache/spark/pull/43788


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45764][PYTHON][DOCS][3.5] Make code block copyable [spark]

2023-11-15 Thread via GitHub



panbingkun commented on PR #43827:
URL: https://github.com/apache/spark/pull/43827#issuecomment-1813732360

   I am making backports for other branches: branch-3.3, branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45827][SQL] Fix variant parquet reader. [spark]

2023-11-15 Thread via GitHub



cloud-fan closed pull request #43825: [SPARK-45827][SQL] Fix variant parquet 
reader.
URL: https://github.com/apache/spark/pull/43825


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45827][SQL] Fix variant parquet reader. [spark]

2023-11-15 Thread via GitHub



cloud-fan commented on PR #43825:
URL: https://github.com/apache/spark/pull/43825#issuecomment-1813730404

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45764][PYTHON][DOCS][3.5] Make code block copyable [spark]

2023-11-15 Thread via GitHub



panbingkun opened a new pull request, #43827:
URL: https://github.com/apache/spark/pull/43827

   ### What changes were proposed in this pull request?
   The pr aims to make code block `copyable `in pyspark docs.
   The pr is backporting to `branch 3.5`.
   
   
   ### Why are the changes needed?
   Improving the usability of PySpark documents.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, users will be able to easily copy code block in pyspark docs.
   
   
   ### How was this patch tested?
   - Manually test.
   - Pass GA.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45945][CONNECT] Add a helper function for `parser` [spark]

2023-11-15 Thread via GitHub



zhengruifeng opened a new pull request, #43826:
URL: https://github.com/apache/spark/pull/43826

   ### What changes were proposed in this pull request?
   Add a helper function for `parser`
   
   
   ### Why are the changes needed?
   we don't use other parser in planner, add this helper just for 
simplification and consistency 
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   ci
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45506][CONNECT] Add ivy URI support to SparkConnect addArtifact [spark]

2023-11-15 Thread via GitHub



LuciferYang commented on PR #43354:
URL: https://github.com/apache/spark/pull/43354#issuecomment-1813722799

   @vsevolodstep-db 
   
   I found that after moving MavenUtilsSuite.scala to the common-utils module, 
it cannot pass the test. Do you know why? The current GA does not test this 
case (this issue will be fixed later), and local reproduction can be executed 
by `build/sbt "common-utils/test"`.
   
   then
   
   ```
   [info] MavenUtilsSuite:
   [info] - incorrect maven coordinate throws error (8 milliseconds)
   [info] - create repo resolvers (24 milliseconds)
   [info] - create additional resolvers (3 milliseconds)
   :: loading settings :: url = 
jar:file:/Users/yangjie01/Library/Caches/Coursier/v1/https/repo1.maven.org/maven2/org/apache/ivy/ivy/2.5.1/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
   [info] - add dependencies works correctly (35 milliseconds)
   [info] - excludes works correctly (2 milliseconds)
   [info] - ivy path works correctly (3 seconds, 759 milliseconds)
   [info] - search for artifact at local repositories *** FAILED *** (2 
seconds, 833 milliseconds)
   [info]   java.lang.RuntimeException: [unresolved dependency: 
my.great.lib#mylib;0.1: java.text.ParseException: [[Fatal Error] 
ivy-0.1.xml.original:22:18: XML 文档结构必须从头至尾包含在同一个实体内。 in 
f/SourceCode/git/spark-mine-sbt/target/tmp/ivy-8b860aca-a9c4-4af9-b15a-ac8c6049b773/cache/my.great.lib/mylib/ivy-0.1.xml.original
   [info] ]]
   [info]   at 
org.apache.spark.util.MavenUtils$.resolveMavenCoordinates(MavenUtils.scala:459)
   [info]   at 
org.apache.spark.util.MavenUtilsSuite.$anonfun$new$25(MavenUtilsSuite.scala:173)
   [info]   at 
org.apache.spark.util.MavenUtilsSuite.$anonfun$new$25$adapted(MavenUtilsSuite.scala:172)
   [info]   at 
org.apache.spark.util.IvyTestUtils$.withRepository(IvyTestUtils.scala:373)
   [info]   at 
org.apache.spark.util.MavenUtilsSuite.$anonfun$new$18(MavenUtilsSuite.scala:172)
   [info]   at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
   [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
   [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
   [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
   [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
   [info]   at org.scalatest.TestSuite.withFixture(TestSuite.scala:196)
   [info]   at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195)
   [info]   at 
org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
   [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
   [info]   at 
org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
   [info]   at 
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
   [info]   at scala.collection.immutable.List.foreach(List.scala:333)
   [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   [info]   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
   [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
   [info]   at 
org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
   [info]   at org.scalatest.Suite.run(Suite.scala:1114)
   [info]   at org.scalatest.Suite.run$(Suite.scala:1096)
   [info]   at 
org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
   [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
   [info]   at 
org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
   [info]   at 
org.apache.spark.util.MavenUtilsSuite.org$scalatest$BeforeAndAfterAll$$super$run(MavenUtilsSuite.scala:36)
   [info]   at 
org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
   [info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
   [info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
   [info]   at

Re: [PR] [SPARK-45938][INFRA] Add `utils` to the dependencies of the `core/unsafe/network_common` module in `module.py` [spark]

2023-11-15 Thread via GitHub



LuciferYang commented on code in PR #43818:
URL: https://github.com/apache/spark/pull/43818#discussion_r1395096661


##
dev/sparktestsupport/modules.py:
##
@@ -178,7 +178,7 @@ def __hash__(self):
 
 core = Module(
 name="core",
-dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher],
+dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher, 
utils],

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45938][INFRA] Add `utils` to the dependencies of the `core/unsafe/network_common` module in `module.py` [spark]

2023-11-15 Thread via GitHub



zhengruifeng commented on code in PR #43818:
URL: https://github.com/apache/spark/pull/43818#discussion_r1395096255


##
dev/sparktestsupport/modules.py:
##
@@ -113,6 +113,14 @@ def __hash__(self):
 ],
 )
 
+utils = Module(

Review Comment:
   yeah, this is python :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45764][PYTHON][DOCS] Make code block copyable [spark]

2023-11-15 Thread via GitHub



panbingkun commented on PR #43799:
URL: https://github.com/apache/spark/pull/43799#issuecomment-1813716352

   > @panbingkun would you mind creating a backporting PR? Actually yeah I 
think it's an important improvement in docs.
   
   Okay, Let me do it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45938][INFRA] Add `utils` to the dependencies of the `core` module in `module.py` [spark]

2023-11-15 Thread via GitHub



LuciferYang commented on code in PR #43818:
URL: https://github.com/apache/spark/pull/43818#discussion_r1395095578


##
dev/sparktestsupport/modules.py:
##
@@ -113,6 +113,14 @@ def __hash__(self):
 ],
 )
 
+utils = Module(

Review Comment:
   Moving it is because of 
   
   https://github.com/apache/spark/assets/1475305/cc40aa65-2c04-4ca6-9a34-3b3da30954c1;>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45938][INFRA] Add `utils` to the dependencies of the `core` module in `module.py` [spark]

2023-11-15 Thread via GitHub



zhengruifeng commented on code in PR #43818:
URL: https://github.com/apache/spark/pull/43818#discussion_r1395092781


##
dev/sparktestsupport/modules.py:
##
@@ -178,7 +178,7 @@ def __hash__(self):
 
 core = Module(
 name="core",
-dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher],
+dependencies=[kvstore, network_common, network_shuffle, unsafe, launcher, 
utils],

Review Comment:
   > utils module is also a direct dependency of unsafe and network-common
   
   let's also add this dependency



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45919][CORE][SQL] Use Java 16 `record` to simplify Java class definition [spark]

2023-11-15 Thread via GitHub



LuciferYang commented on PR #43796:
URL: https://github.com/apache/spark/pull/43796#issuecomment-1813710853

   rebased


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR] Fix some typo [spark]

2023-11-15 Thread via GitHub



HyukjinKwon closed pull request #43724: [MINOR] Fix some typo
URL: https://github.com/apache/spark/pull/43724


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45922][CONNECT][CLIENT] Minor retries refactoring (follow-up to multiple policies) [spark]

2023-11-15 Thread via GitHub



HyukjinKwon commented on PR #43800:
URL: https://github.com/apache/spark/pull/43800#issuecomment-1813710341

   Mind retriggering 
https://github.com/cdkrot/apache_spark/actions/runs/6877183050/job/18704368968? 
I think it might be related.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR] Fix some typo [spark]

2023-11-15 Thread via GitHub



HyukjinKwon commented on PR #43724:
URL: https://github.com/apache/spark/pull/43724#issuecomment-1813710430

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45562][DOCS] Regenerate `docs/sql-error-conditions.md` and add `42KDF` to `SQLSTATE table` in `error/README.md` [spark]

2023-11-15 Thread via GitHub



LuciferYang commented on PR #43817:
URL: https://github.com/apache/spark/pull/43817#issuecomment-1813709899

   Thanks @dongjoon-hyun @HyukjinKwon @beliefer @sandip-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45930][SQL] Support non-deterministic UDFs in MapInPandas/MapInArrow [spark]

2023-11-15 Thread via GitHub



HyukjinKwon closed pull request #43810: [SPARK-45930][SQL] Support 
non-deterministic UDFs in MapInPandas/MapInArrow
URL: https://github.com/apache/spark/pull/43810


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45930][SQL] Support non-deterministic UDFs in MapInPandas/MapInArrow [spark]

2023-11-15 Thread via GitHub



HyukjinKwon commented on PR #43810:
URL: https://github.com/apache/spark/pull/43810#issuecomment-1813707722

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44488][SQL] Support deserializing long types when creating `Metadata` object from JObject [spark]

2023-11-15 Thread via GitHub



HyukjinKwon commented on PR #42083:
URL: https://github.com/apache/spark/pull/42083#issuecomment-1813704504

   It will be available from 4.0.0 most likely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45533][CORE] Use j.l.r.Cleaner instead of finalize for RocksDBIterator/LevelDBIterator [spark]

2023-11-15 Thread via GitHub



LuciferYang commented on code in PR #43502:
URL: https://github.com/apache/spark/pull/43502#discussion_r1395081107


##
common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDBIterator.java:
##
@@ -182,23 +193,34 @@ public boolean skip(long n) {
 
   @Override
   public synchronized void close() throws IOException {
-db.notifyIteratorClosed(this);
+db.notifyIteratorClosed(it);
 if (!closed) {
-  it.close();
-  closed = true;
-  next = null;
+  try {
+it.close();
+  } finally {
+closed = true;
+next = null;
+cancelResourceClean();

Review Comment:
   Yes, we have discussed this issue. The reason for not directly calling 
`this.cleaner.clean()` is because the close process in `Cleaner` has added the 
operation of `synchronized (this._db)`, which is slightly different from the 
semantics of the original `close()` method. For specific discussions, please 
refer to this thread: 
   
   https://github.com/apache/spark/pull/43502#discussion_r1372954706



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-15 Thread via GitHub



yaooqinn commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1813660864

   > What do you mean by this, are you saying the Spark on YARN handling of 
preempted containers is not working properly? Meaning if the container is 
preempted it should not show up as an executor failure. Are you seeing those 
preempted containers show up as failed?
   Or are you saying that yes Spark on YARN doesn't mark preempted as failed?
   
   PREEMPTED is ok, and its cases are not counted by executor failure tracker, 
I was wrong about this, sorry to bother.
   
   > If that is the case then Spark should allow users to turn 
spark.executor.maxNumFailures off or I assume you could do the same thing by 
setting it to int.maxvalue.
   
   There are pros and cons to this suggestion, I guess. Disabling the executor 
failure tracker certainly keeps the app alive, but at the same time invalidates 
fast fail.
   
   > As implemented this seems very arbitrary and I would think hard for a 
normal user to set and use this feature.
   
   Most of configurations with numeric value and the defaults in spark are 
arbitrary?
   
   
   > I don't understand why this isn't the same as minimum number of executors 
as that seems more in line - saying you need some minimum number for this 
application to run and by the way its ok to keep running with this is launching 
new executors is failing.
   
   minimum number of executors can be 0
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45931][PYTHON][DOCS] Refine docstring of mapInPandas [spark]

2023-11-15 Thread via GitHub



HyukjinKwon closed pull request #43811: [SPARK-45931][PYTHON][DOCS] Refine 
docstring of mapInPandas
URL: https://github.com/apache/spark/pull/43811


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45931][PYTHON][DOCS] Refine docstring of mapInPandas [spark]

2023-11-15 Thread via GitHub



HyukjinKwon commented on PR #43811:
URL: https://github.com/apache/spark/pull/43811#issuecomment-1813617078

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45936][PS] Optimize `Index.symmetric_difference` [spark]

2023-11-15 Thread via GitHub



HyukjinKwon closed pull request #43816: [SPARK-45936][PS] Optimize 
`Index.symmetric_difference`
URL: https://github.com/apache/spark/pull/43816


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45936][PS] Optimize `Index.symmetric_difference` [spark]

2023-11-15 Thread via GitHub



HyukjinKwon commented on PR #43816:
URL: https://github.com/apache/spark/pull/43816#issuecomment-1813613763

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45935][PYTHON][DOCS] Fix RST files link substitutions error [spark]

2023-11-15 Thread via GitHub



panbingkun commented on code in PR #43815:
URL: https://github.com/apache/spark/pull/43815#discussion_r1395047884


##
python/docs/source/conf.py:
##
@@ -102,9 +102,9 @@
 .. |examples| replace:: Examples
 .. _examples: https://github.com/apache/spark/tree/{0}/examples/src/main/python
 .. |downloading| replace:: Downloading
-.. _downloading: https://spark.apache.org/docs/{1}/building-spark.html
+.. _downloading: https://spark.apache.org/docs/{1}/#downloading
 .. |building_spark| replace:: Building Spark
-.. _building_spark: https://spark.apache.org/docs/{1}/#downloading
+.. _building_spark: https://spark.apache.org/docs/{1}/building-spark.html

Review Comment:
   Yes, I have checked the branches: branch-3.3, branch-3.4, and branch-3.5, 
which have all been affected. Therefore, I have added: 3.5.0, 3.4.1, 3.3.3. If 
there are conflicts during the merge process, please let me know and I will 
resubmit them on each branch. Thank you very much for your reminder.
   https://github.com/apache/spark/assets/15246973/00795396-3aef-47af-ab39-daff66686228;>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45827] Fix variant parquet reader. [spark]

2023-11-15 Thread via GitHub



chenhao-db commented on code in PR #43825:
URL: https://github.com/apache/spark/pull/43825#discussion_r1395045528


##
sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala:
##
@@ -73,5 +73,12 @@ class VariantSuite extends QueryTest with SharedSparkSession 
{
   values.map(v => if (v == null) "null" else v.debugString()).sorted
 }
 assert(prepareAnswer(input) == prepareAnswer(result))
+
+withTempDir { dir =>

Review Comment:
   Because the variant values it writes are all non-null. This only causes an 
issue when there is a null variant value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45827] Fix variant parquet reader. [spark]

2023-11-15 Thread via GitHub



cloud-fan commented on code in PR #43825:
URL: https://github.com/apache/spark/pull/43825#discussion_r1395044163


##
sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala:
##
@@ -73,5 +73,12 @@ class VariantSuite extends QueryTest with SharedSparkSession 
{
   values.map(v => if (v == null) "null" else v.debugString()).sorted
 }
 assert(prepareAnswer(input) == prepareAnswer(result))
+
+withTempDir { dir =>

Review Comment:
   The `basic tests` test case also test parquet write an read, why it didn't 
expose the bug?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-15 Thread via GitHub



panbingkun commented on PR #37588:
URL: https://github.com/apache/spark/pull/37588#issuecomment-1813555489

   @cloud-fan If you have time, could you please take a look at this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45827] Fix variant parquet reader. [spark]

2023-11-15 Thread via GitHub



chenhao-db commented on PR #43825:
URL: https://github.com/apache/spark/pull/43825#issuecomment-1813534407

   @cloud-fan @HyukjinKwon could you help take a look? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44699][CORE] Add log when finished write events to file in EventLogFileWriter.closeWriter [spark]

2023-11-15 Thread via GitHub



github-actions[bot] commented on PR #42372:
URL: https://github.com/apache/spark/pull/42372#issuecomment-1813504639

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [Spark Ticket][WIP]Added a warning to pop up in the case the user doesn't use gpus [spark]

2023-11-15 Thread via GitHub



github-actions[bot] commented on PR #42308:
URL: https://github.com/apache/spark/pull/42308#issuecomment-1813504691

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44685][SQL] Remove deprecated Catalog#createExternalTable [spark]

2023-11-15 Thread via GitHub



github-actions[bot] commented on PR #42356:
URL: https://github.com/apache/spark/pull/42356#issuecomment-1813504669

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45525][SQL][PYTHON] Initial support for Python data source write [spark]

2023-11-15 Thread via GitHub



allisonwang-db commented on PR #43791:
URL: https://github.com/apache/spark/pull/43791#issuecomment-1813486370

   @cloud-fan @HyukjinKwon @ueshin This PR is ready for review. It focuses on 
the optimizer/execution part of data source write and is independent of the 
DataFrameWriter.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45592][SPARK-45282][SQL] Correctness issue in AQE with InMemoryTableScanExec [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on PR #43760:
URL: https://github.com/apache/spark/pull/43760#issuecomment-1813474266

   For the record, I landed at branch-3.4 after resolving conflicts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45935][PYTHON][DOCS] Fix RST files link substitutions error [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on code in PR #43815:
URL: https://github.com/apache/spark/pull/43815#discussion_r1394963625


##
python/docs/source/conf.py:
##
@@ -102,9 +102,9 @@
 .. |examples| replace:: Examples
 .. _examples: https://github.com/apache/spark/tree/{0}/examples/src/main/python
 .. |downloading| replace:: Downloading
-.. _downloading: https://spark.apache.org/docs/{1}/building-spark.html
+.. _downloading: https://spark.apache.org/docs/{1}/#downloading
 .. |building_spark| replace:: Building Spark
-.. _building_spark: https://spark.apache.org/docs/{1}/#downloading
+.. _building_spark: https://spark.apache.org/docs/{1}/building-spark.html

Review Comment:
   If this happens in Apache Spark 3.5.0, could you add `3.5.0` to the affected 
version, @panbingkun ?
   
   https://github.com/apache/spark/assets/9700541/d3966d02-8572-4a96-b56a-a4bf729e65f9;>
   



##
python/docs/source/conf.py:
##
@@ -102,9 +102,9 @@
 .. |examples| replace:: Examples
 .. _examples: https://github.com/apache/spark/tree/{0}/examples/src/main/python
 .. |downloading| replace:: Downloading
-.. _downloading: https://spark.apache.org/docs/{1}/building-spark.html
+.. _downloading: https://spark.apache.org/docs/{1}/#downloading
 .. |building_spark| replace:: Building Spark
-.. _building_spark: https://spark.apache.org/docs/{1}/#downloading
+.. _building_spark: https://spark.apache.org/docs/{1}/building-spark.html

Review Comment:
   If this happens in Apache Spark 3.5.0, could you add `3.5.0` to the affected 
version, @panbingkun ?
   
   https://github.com/apache/spark/assets/9700541/d3966d02-8572-4a96-b56a-a4bf729e65f9;>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45930][SQL] Support non-deterministic UDFs in MapInPandas/MapInArrow [spark]

2023-11-15 Thread via GitHub



allisonwang-db commented on PR #43810:
URL: https://github.com/apache/spark/pull/43810#issuecomment-1813393808

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45827] Fix variant parquet reader. [spark]

2023-11-15 Thread via GitHub



chenhao-db opened a new pull request, #43825:
URL: https://github.com/apache/spark/pull/43825

   ## What changes were proposed in this pull request?
   
   This is a follow-up of https://github.com/apache/spark/pull/43707. The 
previous PR missed a piece in the variant parquet reader: we are treating the 
variant type as `struct`, so it also needs a 
similar `assembleStruct` process in the Parquet reader to correctly set the 
nullness of variant values from def/rep levels.
   
   ## How was this patch tested?
   
   Extend the existing unit test. It would fail without the change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on PR #43814:
URL: https://github.com/apache/spark/pull/43814#issuecomment-1813372467

   I also cherry-picked this to branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44488][SQL] Support deserializing long types when creating `Metadata` object from JObject [spark]

2023-11-15 Thread via GitHub



scottsand-db commented on PR #42083:
URL: https://github.com/apache/spark/pull/42083#issuecomment-1813363910

   Will this make Apache Spark 3.6 release? Or 4.0?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on PR #43814:
URL: https://github.com/apache/spark/pull/43814#issuecomment-1813346533

   Also, thank you, @yaooqinn and @bjornjorgensen , too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun closed pull request #43814: [SPARK-45934][DOCS] Fix `Spark 
Standalone` documentation table layout
URL: https://github.com/apache/spark/pull/43814


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on PR #43814:
URL: https://github.com/apache/spark/pull/43814#issuecomment-1813345234

   Thank you so much, @huaxingao . Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]

2023-11-15 Thread via GitHub



huaxingao commented on PR #43814:
URL: https://github.com/apache/spark/pull/43814#issuecomment-1813344307

   LGTM Thanks @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45719][K8S][TESTS] Upgrade AWS SDK to v2 for Kubernetes IT [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on PR #43510:
URL: https://github.com/apache/spark/pull/43510#issuecomment-1813343142

   Welcome to the Apache Spark community, @junyuc25 !
   I added you to the Apache Spark contributor group and assigned SPARK-45719 
to you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45719][K8S][TESTS] Upgrade AWS SDK to v2 for Kubernetes IT [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun closed pull request #43510: [SPARK-45719][K8S][TESTS] Upgrade AWS 
SDK to v2 for Kubernetes IT
URL: https://github.com/apache/spark/pull/43510


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SARK-45866][SQL] Fix for Reuse of Exchange in AQE not happening when DPP filters are pushed down to the underlying Scan (like iceberg) [spark]

2023-11-15 Thread via GitHub



ahshahid commented on PR #43824:
URL: https://github.com/apache/spark/pull/43824#issuecomment-1813334777

   I will add the documentation to the new methods in next commit


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SARK-45866][SQL] Fix for Reuse of Exchange in AQE not happening when DPP filters are pushed down to the underlying Scan (like iceberg) [spark]

2023-11-15 Thread via GitHub



ahshahid opened a new pull request, #43824:
URL: https://github.com/apache/spark/pull/43824

   ### What changes were proposed in this pull request?
   The main change in this PR is to augment the trait of 
SupportsRuntimeV2Filtering by adding two new methods
   `default boolean equalToIgnoreRuntimeFilters(Scan other) {
   return this.equals(other);
 }
   
 default int hashCodeIgnoreRuntimeFilters() {
   return this.hashCode();
 }`
   
   which the underlying V2 Scan should implement.
   The BatchScanExec also gets modified accordingly where it invokes this two 
methods to check the equality of the Scan.
   
   Pls note that this PR includes code of 2 other PRs too
   1) [SPARK-45658](https://github.com/apache/spark/pull/43737)
   This PR though not required per se, but is good to have for correctness ( & 
my other PR for broadcast var pushdown relies on this  fix)
   2) [SPARK-45926](https://github.com/apache/spark/pull/43808)
   This PR is necessary to reproduce the issue and hence its code is needed for 
this PR to show the issue.
   
   **Also for this test to pass the code of DataSourceV2Relation.computeStats 
should disable throwing assertion error in testing, as that is a separate bug 
which gets hit, when the bug test for this PR is run.**
   
   ### Why are the changes needed?
   This change is need IMO to fix the issue of re-use of exchange not happening 
when DPP filters are pushed to the scan level.
   The issue is this:
   In certain types of queries for eg TPCDS Query 14b, the reuse of exchange 
does not happen in AQE , resulting in perf degradation.
   The spark TPCDS tests are unable to catch the problem, because the 
InMemoryScan used for testing do not implement the equals & hashCode correctly 
, in the sense , that they do take into account the pushed down run time 
filters.
   
   In concrete Scan implementations, for eg iceberg's SparkBatchQueryScan , the 
equality check , apart from other things, also involves Runtime Filters pushed 
( which is correct).
   
   Below is description of how this issue surfaces.
   For a given stage being materialized, just before materialization starts, 
the run time filters are confined to the BatchScanExec level.
   Only when the actual RDD corresponding to the BatchScanExec, is being 
evaluated, do the runtime filters get pushed to the underlying Scan.
   
   Now if a new stage is created and it checks in the stageCache using its 
canonicalized plan to see if a stage can be reused, it fails to find the 
r-usable stage even if the stage exists, because the canonicalized spark plan 
present in the stage cache, has now the run time filters pushed to the Scan , 
so the incoming canonicalized spark plan does not match the key as their 
underlying scans differ . that is incoming spark plan's scan does not have 
runtime filters , while the canonicalized spark plan present as key in the 
stage cache has the scan with runtime filters pushed.
   
   The fix as I have worked is to provide, two methods in the 
SupportsRuntimeV2Filtering interface ,
   default boolean equalToIgnoreRuntimeFilters(Scan other)
   
   { return this.equals(other); }
   default int hashCodeIgnoreRuntimeFilters()
   
   { return this.hashCode(); }
   In the BatchScanExec, if the scan implements SupportsRuntimeV2Filtering, 
then instead of batch.equals, it should call scan.equalToIgnoreRuntimeFilters
   
   And the underlying Scan implementations should provide equality which 
excludes run time filters.
   
   Similarly the hashCode of BatchScanExec, should use 
scan.hashCodeIgnoreRuntimeFilters instead of ( batch.hashCode).
   
   ### Does this PR introduce _any_ user-facing change?
   No. But the respective DataSourceV2Relations may need to augment their code.
   
   ### How was this patch tested?
   Added bug test for the same.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order [spark]

2023-11-15 Thread via GitHub



abellina commented on code in PR #43627:
URL: https://github.com/apache/spark/pull/43627#discussion_r1394870890


##
core/src/main/scala/org/apache/spark/SparkEnv.scala:
##
@@ -415,6 +418,11 @@ object SparkEnv extends Logging {
 advertiseAddress, blockManagerPort, numUsableCores, 
blockManagerMaster.driverEndpoint)
 
 // NB: blockManager is not valid until initialize() is called later.
+// SPARK-45762 introduces a change where the ShuffleManager is 
initialized later
+// in the SparkContext and Executor, to allow for custom 
ShuffleManagers defined
+// in user jars. In the executor, the BlockManager uses a lazy val to 
obtain the
+// shuffleManager from the SparkEnv. In the driver, the SparkEnv's 
shuffleManager

Review Comment:
   Thanks @tgravescs. Handled both comments here: 
https://github.com/apache/spark/pull/43627/commits/6d002a361ac2c1dfad48ee530766c9b0a605696f



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on PR #43814:
URL: https://github.com/apache/spark/pull/43814#issuecomment-1813319975

   Could you review this `Spark Standalone` documentation PR when you have some 
time, @huaxingao ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45856] Move ArtifactManager from Spark Connect into SparkSession (sql/core) [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on code in PR #43735:
URL: https://github.com/apache/spark/pull/43735#discussion_r1394847206


##
sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##
@@ -243,6 +244,16 @@ class SparkSession private(
   @Unstable
   def streams: StreamingQueryManager = sessionState.streamingQueryManager
 
+  /**
+   * Returns an `ArtifactManager` that supports adding, managing and using 
session-scoped artifacts
+   * (jars, classfiles, etc).
+   *
+   * @since 3.5.1

Review Comment:
   This should be 4.0.0 because this PR is for Apache Spark 4.0.0, @vicennial .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order [spark]

2023-11-15 Thread via GitHub



tgravescs commented on code in PR #43627:
URL: https://github.com/apache/spark/pull/43627#discussion_r1394837773


##
core/src/main/scala/org/apache/spark/SparkEnv.scala:
##
@@ -415,6 +418,11 @@ object SparkEnv extends Logging {
 advertiseAddress, blockManagerPort, numUsableCores, 
blockManagerMaster.driverEndpoint)
 
 // NB: blockManager is not valid until initialize() is called later.
+// SPARK-45762 introduces a change where the ShuffleManager is 
initialized later
+// in the SparkContext and Executor, to allow for custom 
ShuffleManagers defined
+// in user jars. In the executor, the BlockManager uses a lazy val to 
obtain the
+// shuffleManager from the SparkEnv. In the driver, the SparkEnv's 
shuffleManager

Review Comment:
   I think this comment it no longer true. Driver SparkEnv shufflemanager is 
created after the plugin initialized.



##
core/src/main/scala/org/apache/spark/SparkEnv.scala:
##
@@ -71,6 +70,12 @@ class SparkEnv (
 val outputCommitCoordinator: OutputCommitCoordinator,
 val conf: SparkConf) extends Logging {
 
+  // We initialize the ShuffleManager later in SparkContext, and Executor, to 
allow

Review Comment:
   ```suggestion
 // We initialize the ShuffleManager later in SparkContext and Executor to 
allow
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45810][Python] Create Python UDTF API to stop consuming rows from the input table [spark]

2023-11-15 Thread via GitHub



ueshin commented on PR #43682:
URL: https://github.com/apache/spark/pull/43682#issuecomment-1813313871

   Thanks! merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45810][Python] Create Python UDTF API to stop consuming rows from the input table [spark]

2023-11-15 Thread via GitHub



ueshin closed pull request #43682: [SPARK-45810][Python] Create Python UDTF API 
to stop consuming rows from the input table
URL: https://github.com/apache/spark/pull/43682


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45868][CONNECT] Make sure `spark.table` use the same parser with vanilla spark [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on PR #43741:
URL: https://github.com/apache/spark/pull/43741#issuecomment-1813308124

   Merged to master. Thank you, @zhengruifeng and all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45868][CONNECT] Make sure `spark.table` use the same parser with vanilla spark [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun closed pull request #43741: [SPARK-45868][CONNECT] Make sure 
`spark.table` use the same parser with vanilla spark
URL: https://github.com/apache/spark/pull/43741


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on PR #43814:
URL: https://github.com/apache/spark/pull/43814#issuecomment-1813302230

   > Thank you for fixing the dokumentasjon for K8S and Standalone :)
   
   Thanks, but I'm going to proceed K8s part in a new JIRA because of the 
previous comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45941][PS] Upgrade `pandas` to version 2.1.3 [spark]

2023-11-15 Thread via GitHub



bjornjorgensen commented on PR #43822:
URL: https://github.com/apache/spark/pull/43822#issuecomment-1813301006

   Thank you @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]

2023-11-15 Thread via GitHub



bjornjorgensen commented on PR #43814:
URL: https://github.com/apache/spark/pull/43814#issuecomment-1813297066

   Thank you for fixing the dokumentasjon for K8S and Standalone :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]

2023-11-15 Thread via GitHub



bjornjorgensen commented on code in PR #43814:
URL: https://github.com/apache/spark/pull/43814#discussion_r1394833447


##
docs/running-on-kubernetes.md:
##
@@ -1203,17 +1203,17 @@ See the [configuration page](configuration.html) for 
information on Spark config
   3.0.0
 
 
-  memoryOverheadFactor
+  spark.kubernetes.memoryOverheadFactor
   0.1
   
-This sets the Memory Overhead Factor that will allocate memory to non-JVM 
memory, which includes off-heap memory allocations, non-JVM tasks, various 
systems processes, and tmpfs-based local directories when 
local.dirs.tmpfs is true. For JVM-based jobs this 
value will default to 0.10 and 0.40 for non-JVM jobs.
+This sets the Memory Overhead Factor that will allocate memory to non-JVM 
memory, which includes off-heap memory allocations, non-JVM tasks, various 
systems processes, and tmpfs-based local directories when 
spark.kubernetes.local.dirs.tmpfs is true. For 
JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs.
 This is done as non-JVM tasks need more non-JVM heap space and such tasks 
commonly fail with "Memory Overhead Exceeded" errors. This preempts this error 
with a higher default.
 This will be overridden by the value set by 
spark.driver.memoryOverheadFactor and 
spark.executor.memoryOverheadFactor explicitly.

Review Comment:
   yes, I did read the K8s part. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on code in PR #43814:
URL: https://github.com/apache/spark/pull/43814#discussion_r1394832622


##
docs/running-on-kubernetes.md:
##
@@ -1203,17 +1203,17 @@ See the [configuration page](configuration.html) for 
information on Spark config
   3.0.0
 
 
-  memoryOverheadFactor
+  spark.kubernetes.memoryOverheadFactor
   0.1
   
-This sets the Memory Overhead Factor that will allocate memory to non-JVM 
memory, which includes off-heap memory allocations, non-JVM tasks, various 
systems processes, and tmpfs-based local directories when 
local.dirs.tmpfs is true. For JVM-based jobs this 
value will default to 0.10 and 0.40 for non-JVM jobs.
+This sets the Memory Overhead Factor that will allocate memory to non-JVM 
memory, which includes off-heap memory allocations, non-JVM tasks, various 
systems processes, and tmpfs-based local directories when 
spark.kubernetes.local.dirs.tmpfs is true. For 
JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs.
 This is done as non-JVM tasks need more non-JVM heap space and such tasks 
commonly fail with "Memory Overhead Exceeded" errors. This preempts this error 
with a higher default.
 This will be overridden by the value set by 
spark.driver.memoryOverheadFactor and 
spark.executor.memoryOverheadFactor explicitly.

Review Comment:
   Here.
   ```
   $ git diff HEAD~2 --stat
docs/spark-standalone.md | 10 ++
1 file changed, 6 insertions(+), 4 deletions(-)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45925][SQL] Making SubqueryBroadcastExec equivalent to SubqueryAdaptiveBroadcastExec [spark]

2023-11-15 Thread via GitHub



ahshahid commented on PR #43807:
URL: https://github.com/apache/spark/pull/43807#issuecomment-1813289805

   @beliefer I think you may be right. In my another PR for 
broadcast-var-pushdown, I am seeing unmodified SubqueryAdaptiveBroadcastExec in 
the stage cache 's keys. May be it is an issue in my code or something else. 
Will check my code again for this. So as of now, I think it makes sense to 
close this PR and also the other PR in SubqueryBroadcastHashExec


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45925][SQL] Making SubqueryBroadcastExec equivalent to SubqueryAdaptiveBroadcastExec [spark]

2023-11-15 Thread via GitHub



ahshahid closed pull request #43807: [SPARK-45925][SQL] Making 
SubqueryBroadcastExec equivalent to SubqueryAdaptiveBroadcastExec
URL: https://github.com/apache/spark/pull/43807


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45924][SQL] Fixing the canonicalization of SubqueryAdaptiveBroadcastExec and making it equivalent with SubqueryBroadcastExec [spark]

2023-11-15 Thread via GitHub



ahshahid closed pull request #43806: [SPARK-45924][SQL] Fixing the 
canonicalization of SubqueryAdaptiveBroadcastExec and making it equivalent with 
SubqueryBroadcastExec
URL: https://github.com/apache/spark/pull/43806


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45942][Core] Only do the thread interruption check for putIterator on executors [spark]

2023-11-15 Thread via GitHub



huanliwang-db opened a new pull request, #43823:
URL: https://github.com/apache/spark/pull/43823

   
   
   ### What changes were proposed in this pull request?
   Only do the thread interruption check for putIterator on executors
   
   
   
   ### Why are the changes needed?
   
   
   https://issues.apache.org/jira/browse/SPARK-45025 
   
   introduces a peaceful thread interruption handling. However, there is an 
edge case: when a streaming query is stopped on the driver, it interrupts the 
stream execution thread. If the streaming query is doing memory store 
operations on driver and performs doPutIterator at the same time, the [unroll 
process will be 
broken](https://github.com/apache/spark/blob/39fc6108bfaaa0ce471f6460880109f948ba5c62/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L224)
 and [returns used 
memory](https://github.com/apache/spark/blob/39fc6108bfaaa0ce471f6460880109f948ba5c62/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L245-L247).
   
   This can result in closeChannelException as it falls into this [case 
clause](https://github.com/apache/spark/blob/aa646d3050028272f7333deaef52f20e6975e0ed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1614-L1622)
 which opens an I/O channel and persists the data into the disk. However, 
because the thread is interrupted, the channel will be closed at the begin: 
https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/nio/channels/spi/AbstractInterruptibleChannel.java#L172
 and throws out closeChannelException
   
   On executors, [the task will be killed if the thread is 
interrupted](https://github.com/apache/spark/blob/39fc6108bfaaa0ce471f6460880109f948ba5c62/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L374),
 however, we don't do it on the driver.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   
   ### How was this patch tested?
   
   Ran MemoryStoreSuite
   ```
   [info] MemoryStoreSuite:
   [info] - reserve/release unroll memory (36 milliseconds)
   [info] - safely unroll blocks (70 milliseconds)
   [info] - safely unroll blocks through putIteratorAsValues (10 milliseconds)
   [info] - safely unroll blocks through putIteratorAsValues off-heap (21 
milliseconds)
   [info] - safely unroll blocks through putIteratorAsBytes (138 milliseconds)
   [info] - PartiallySerializedBlock.valuesIterator (6 milliseconds)
   [info] - PartiallySerializedBlock.finishWritingToStream (5 milliseconds)
   [info] - multiple unrolls by the same thread (8 milliseconds)
   [info] - lazily create a big ByteBuffer to avoid OOM if it cannot be put 
into MemoryStore (3 milliseconds)
   [info] - put a small ByteBuffer to MemoryStore (3 milliseconds)
   [info] - SPARK-22083: Release all locks in evictBlocksToFreeSpace (43 
milliseconds)
   [info] - put user-defined objects to MemoryStore and remove (5 milliseconds)
   [info] - put user-defined objects to MemoryStore and clear (4 milliseconds)
   [info] Run completed in 1 second, 587 milliseconds.
   [info] Total number of tests run: 13
   [info] Suites: completed 1, aborted 0
   [info] Tests: succeeded 13, failed 0, canceled 0, ignored 0, pending 0
   [info] All tests passed.
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45924][SQL] Fixing the canonicalization of SubqueryAdaptiveBroadcastExec and making it equivalent with SubqueryBroadcastExec [spark]

2023-11-15 Thread via GitHub



ahshahid commented on PR #43806:
URL: https://github.com/apache/spark/pull/43806#issuecomment-1813288994

   @beliefer I think you may be right. In my another PR for 
broadcast-var-pushdown, I am seeing unmodified SubqueryAdaptiveBroadcastExec in 
the stage cache 's keys. May be it is an issue in my code or something else. 
Will check my code again for this. So as of now, I think it makes sense to 
close this PR and also the other PR in SubqueryBroadcastHashExec


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45527][CORE] Use fraction to do the resource calculation [spark]

2023-11-15 Thread via GitHub



tgravescs commented on code in PR #43494:
URL: https://github.com/apache/spark/pull/43494#discussion_r1384051957


##
core/src/main/scala/org/apache/spark/resource/ResourceUtils.scala:
##
@@ -170,16 +170,16 @@ private[spark] object ResourceUtils extends Logging {
   // integer amount and the number of slots per address. For instance, if the 
amount is 0.5,
   // the we get (1, 2) back out. This indicates that for each 1 address, it 
has 2 slots per
   // address, which allows you to put 2 tasks on that address. Note if amount 
is greater
-  // than 1, then the number of slots per address has to be 1. This would 
indicate that a
+  // than 1, then the number of parts per address has to be 1. This would 
indicate that a
   // would have multiple addresses assigned per task. This can be used for 
calculating
   // the number of tasks per executor -> (executorAmount * numParts) / 
(integer amount).
   // Returns tuple of (integer amount, numParts)
   def calculateAmountAndPartsForFraction(doubleAmount: Double): (Int, Int) = {
-val parts = if (doubleAmount <= 0.5) {
+val parts = if (doubleAmount <= 1.0) {

Review Comment:
   did you move this check somewhere else?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on code in PR #43814:
URL: https://github.com/apache/spark/pull/43814#discussion_r1394826363


##
docs/running-on-kubernetes.md:
##
@@ -1203,17 +1203,17 @@ See the [configuration page](configuration.html) for 
information on Spark config
   3.0.0
 
 
-  memoryOverheadFactor
+  spark.kubernetes.memoryOverheadFactor
   0.1
   
-This sets the Memory Overhead Factor that will allocate memory to non-JVM 
memory, which includes off-heap memory allocations, non-JVM tasks, various 
systems processes, and tmpfs-based local directories when 
local.dirs.tmpfs is true. For JVM-based jobs this 
value will default to 0.10 and 0.40 for non-JVM jobs.
+This sets the Memory Overhead Factor that will allocate memory to non-JVM 
memory, which includes off-heap memory allocations, non-JVM tasks, various 
systems processes, and tmpfs-based local directories when 
spark.kubernetes.local.dirs.tmpfs is true. For 
JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs.
 This is done as non-JVM tasks need more non-JVM heap space and such tasks 
commonly fail with "Memory Overhead Exceeded" errors. This preempts this error 
with a higher default.
 This will be overridden by the value set by 
spark.driver.memoryOverheadFactor and 
spark.executor.memoryOverheadFactor explicitly.

Review Comment:
   It seems that you are looking at the first commit. I removed K8s part from 
this PR completely at the latest commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45934][DOCS] Fix `Spark Standalone` documentation table layout [spark]

2023-11-15 Thread via GitHub



dongjoon-hyun commented on code in PR #43814:
URL: https://github.com/apache/spark/pull/43814#discussion_r1394824967


##
docs/running-on-kubernetes.md:
##
@@ -1203,17 +1203,17 @@ See the [configuration page](configuration.html) for 
information on Spark config
   3.0.0
 
 
-  memoryOverheadFactor
+  spark.kubernetes.memoryOverheadFactor
   0.1
   
-This sets the Memory Overhead Factor that will allocate memory to non-JVM 
memory, which includes off-heap memory allocations, non-JVM tasks, various 
systems processes, and tmpfs-based local directories when 
local.dirs.tmpfs is true. For JVM-based jobs this 
value will default to 0.10 and 0.40 for non-JVM jobs.
+This sets the Memory Overhead Factor that will allocate memory to non-JVM 
memory, which includes off-heap memory allocations, non-JVM tasks, various 
systems processes, and tmpfs-based local directories when 
spark.kubernetes.local.dirs.tmpfs is true. For 
JVM-based jobs this value will default to 0.10 and 0.40 for non-JVM jobs.
 This is done as non-JVM tasks need more non-JVM heap space and such tasks 
commonly fail with "Memory Overhead Exceeded" errors. This preempts this error 
with a higher default.
 This will be overridden by the value set by 
spark.driver.memoryOverheadFactor and 
spark.executor.memoryOverheadFactor explicitly.

Review Comment:
   This is only for `Spark Standalone` documetation, @bjornjorgensen 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 188 matches

Mail list logo