date:20231102

Re: [PR] [SPARK-45777][CORE] Support `spark.test.appId` in `LocalSchedulerBackend` [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun commented on code in PR #43645:
URL: https://github.com/apache/spark/pull/43645#discussion_r1381186064


##
core/src/main/scala/org/apache/spark/scheduler/local/LocalSchedulerBackend.scala:
##
@@ -110,7 +110,7 @@ private[spark] class LocalSchedulerBackend(
 val totalCores: Int)
   extends SchedulerBackend with ExecutorBackend with Logging {
 
-  private val appId = "local-" + System.currentTimeMillis
+  private val appId = sys.props.getOrElse("spark.test.appId", "local-" + 
System.currentTimeMillis)

Review Comment:
   Although I did to avoid `SparkConf(false)` case, in my use case, I always 
use as environment variables and it will be okay to load from `SparkConf`. Do 
you want me to change, @yaooqinn ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][DOCS] Fixed typo [spark]

2023-11-02 Thread via GitHub



yaooqinn commented on code in PR #43634:
URL: https://github.com/apache/spark/pull/43634#discussion_r1381181952


##
sql/README.md:
##
@@ -6,7 +6,7 @@ This module provides support for executing relational queries 
expressed in eithe
 Spark SQL is broken up into four subprojects:
  - Catalyst (sql/catalyst) - An implementation-agnostic framework for 
manipulating trees of relational operators and expressions.
  - Execution (sql/core) - A query planner / execution engine for translating 
Catalyst's logical query plans into Spark RDDs.  This component also includes a 
new public interface, SQLContext, that allows users to execute SQL or LINQ 
statements against existing RDDs and Parquet files.
- - Hive Support (sql/hive) - Includes extensions that allow users to write 
queries using a subset of HiveQL and access data from a Hive Metastore using 
Hive SerDes. There are also wrappers that allow users to run queries that 
include Hive UDFs, UDAFs, and UDTFs.
+ - Hive Support (sql/hive) - Includes extensions that allow users to write 
queries using a subset of HiveSQL and access data from a Hive Metastore using 
Hive SerDes. There are also wrappers that allow users to run queries that 
include Hive UDFs, UDAFs, and UDTFs.

Review Comment:
   From what I have gathered, HiveQL appears to be a widely accepted term. e.g. 
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.ProcessingHiveQL.html
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [MINOR][DOCS] Fixed typo [spark]

2023-11-02 Thread via GitHub



yaooqinn commented on PR #43634:
URL: https://github.com/apache/spark/pull/43634#issuecomment-1791921276

   Hi @YuanHanzhong, thanks for your first contribution to spark. 
   
   However, could you please include #43636 in this patch as well? Also, kindly 
answer the questions in the PR template instead of removing them all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45740][SQL] Relax the node prefix of SparkPlanGraphCluster [spark]

2023-11-02 Thread via GitHub



yaooqinn commented on PR #43602:
URL: https://github.com/apache/spark/pull/43602#issuecomment-1791914386

   Can you attach the screenshot of ui after this change?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order [spark]

2023-11-02 Thread via GitHub



mridulm commented on code in PR #43627:
URL: https://github.com/apache/spark/pull/43627#discussion_r1381167046


##
core/src/main/scala/org/apache/spark/SparkEnv.scala:
##
@@ -71,6 +69,10 @@ class SparkEnv (
 val outputCommitCoordinator: OutputCommitCoordinator,
 val conf: SparkConf) extends Logging {
 
+  // We initialize the ShuffleManager later, in SparkContext and Executor, to 
allow
+  // user jars to define custom ShuffleManagers.
+  var shuffleManager: ShuffleManager = _

Review Comment:
   Given `SparkEnv` is a `DeveloperApi`, let us not expose this for mutation.
   
   ```suggestion
 private var _shuffleManager: ShuffleManager = _
 
 def shuffleManager: ShuffleManager = _shuffleManager
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]

2023-11-02 Thread via GitHub



mridulm closed pull request #43596: [SPARK-45730][CORE] Make 
ReloadingX509TrustManagerSuite less flaky
URL: https://github.com/apache/spark/pull/43596


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]

2023-11-02 Thread via GitHub



mridulm commented on PR #43596:
URL: https://github.com/apache/spark/pull/43596#issuecomment-1791900605

   Merged to master.
   Thanks for fixing this @hasnain-db !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]

2023-11-02 Thread via GitHub



mridulm commented on code in PR #43596:
URL: https://github.com/apache/spark/pull/43596#discussion_r1381165616


##
common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java:
##
@@ -161,14 +161,17 @@ public void testReload() throws Exception {
   // At this point we haven't reloaded, just the initial load
   assertEquals(0, tm.reloadCount);
 
+  // Wait so that the file modification time is different
+  Thread.sleep((tm.getReloadInterval() + 200));
+
   // Add another cert
   Map certs = new HashMap();
   certs.put("cert1", cert1);
   certs.put("cert2", cert2);
   createTrustStore(trustStore, "password", certs);
 
-  // Wait up to 5s until we reload
-  waitForReloadCount(tm, 1, 50);
+  // Wait up to 10s until we reload
+  waitForReloadCount(tm, 1, 100);

Review Comment:
   That should do it :-)
   I would expect a lot more tests (outside of this effort) to be impacted with 
machines that loaded !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45777][CORE] Support `spark.test.appId` in `LocalSchedulerBackend` [spark]

2023-11-02 Thread via GitHub



yaooqinn commented on code in PR #43645:
URL: https://github.com/apache/spark/pull/43645#discussion_r1381162154


##
core/src/main/scala/org/apache/spark/scheduler/local/LocalSchedulerBackend.scala:
##
@@ -110,7 +110,7 @@ private[spark] class LocalSchedulerBackend(
 val totalCores: Int)
   extends SchedulerBackend with ExecutorBackend with Logging {
 
-  private val appId = "local-" + System.currentTimeMillis
+  private val appId = sys.props.getOrElse("spark.test.appId", "local-" + 
System.currentTimeMillis)

Review Comment:
   Is there a particular reason to retrieve this from the system properties 
instead of the SparkConf?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45774][CORE][UI] Support `spark.ui.historyServerUrl` in `ApplicationPage` [spark]

2023-11-02 Thread via GitHub



yaooqinn commented on code in PR #43643:
URL: https://github.com/apache/spark/pull/43643#discussion_r1381161116


##
core/src/main/scala/org/apache/spark/deploy/master/ui/ApplicationPage.scala:
##
@@ -98,6 +98,11 @@ private[ui] class ApplicationPage(parent: MasterWebUI) 
extends WebUIPage("app")
 Application Detail UI
 
+  } else if (parent.master.historyServerUrl.nonEmpty) {
+
+

Review Comment:
   The history server is configured application by application, not spark 
master. So, IMO, we shall get the history server from the app itself instead.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45777][CORE] Support `spark.test.appId` in `LocalSchedulerBackend` [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun commented on PR #43645:
URL: https://github.com/apache/spark/pull/43645#issuecomment-1791879563

   When you have some time, could you review this test conf PR, too, 
@LuciferYang ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45527][core] Use fraction to do the resource calculation [spark]

2023-11-02 Thread via GitHub



wbo4958 commented on code in PR #43494:
URL: https://github.com/apache/spark/pull/43494#discussion_r1379714430


##
core/src/main/scala/org/apache/spark/resource/ResourceAllocator.scala:
##
@@ -29,59 +65,54 @@ private[spark] trait ResourceAllocator {
 
   protected def resourceName: String
   protected def resourceAddresses: Seq[String]
-  protected def slotsPerAddress: Int
 
   /**
-   * Map from an address to its availability, a value > 0 means the address is 
available,
-   * while value of 0 means the address is fully assigned.
-   *
-   * For task resources ([[org.apache.spark.scheduler.ExecutorResourceInfo]]), 
this value
-   * can be a multiple, such that each address can be allocated up to 
[[slotsPerAddress]]
-   * times.
+   * Map from an address to its availability default to RESOURCE_TOTAL_AMOUNT, 
a value > 0 means
+   * the address is available, while value of 0 means the address is fully 
assigned.
*/
   private lazy val addressAvailabilityMap = {
-mutable.HashMap(resourceAddresses.map(_ -> slotsPerAddress): _*)
+mutable.HashMap(resourceAddresses.map(address => address -> 
RESOURCE_TOTAL_AMOUNT): _*)
   }
 
   /**
-   * Sequence of currently available resource addresses.
-   *
-   * With [[slotsPerAddress]] greater than 1, [[availableAddrs]] can contain 
duplicate addresses
-   * e.g. with [[slotsPerAddress]] == 2, availableAddrs for addresses 0 and 1 
can look like
-   * Seq("0", "0", "1"), where address 0 has two assignments available, and 1 
has one.
+   * Get the resources and its amounts.
+   * @return the resources amounts
+   */
+  def resourcesAmounts: Map[String, Double] = addressAvailabilityMap.map {

Review Comment:
   Hi Tom, for "if we start getting into the requests where user could ask for 
25 resources then we could hit overflow issues"
   
   I couldn't understand why hitting the overflow issue?



##
core/src/main/scala/org/apache/spark/resource/ResourceAllocator.scala:
##
@@ -29,59 +65,54 @@ private[spark] trait ResourceAllocator {
 
   protected def resourceName: String
   protected def resourceAddresses: Seq[String]
-  protected def slotsPerAddress: Int
 
   /**
-   * Map from an address to its availability, a value > 0 means the address is 
available,
-   * while value of 0 means the address is fully assigned.
-   *
-   * For task resources ([[org.apache.spark.scheduler.ExecutorResourceInfo]]), 
this value
-   * can be a multiple, such that each address can be allocated up to 
[[slotsPerAddress]]
-   * times.
+   * Map from an address to its availability default to RESOURCE_TOTAL_AMOUNT, 
a value > 0 means
+   * the address is available, while value of 0 means the address is fully 
assigned.
*/
   private lazy val addressAvailabilityMap = {
-mutable.HashMap(resourceAddresses.map(_ -> slotsPerAddress): _*)
+mutable.HashMap(resourceAddresses.map(address => address -> 
RESOURCE_TOTAL_AMOUNT): _*)
   }
 
   /**
-   * Sequence of currently available resource addresses.
-   *
-   * With [[slotsPerAddress]] greater than 1, [[availableAddrs]] can contain 
duplicate addresses
-   * e.g. with [[slotsPerAddress]] == 2, availableAddrs for addresses 0 and 1 
can look like
-   * Seq("0", "0", "1"), where address 0 has two assignments available, and 1 
has one.
+   * Get the resources and its amounts.
+   * @return the resources amounts
+   */
+  def resourcesAmounts: Map[String, Double] = addressAvailabilityMap.map {

Review Comment:
   Yes, Leave in the Long should be more effective.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45687][SQL] Fix `Passing an explicit array value to a Scala varargs method is deprecated` [spark]

2023-11-02 Thread via GitHub



ivoson commented on PR #43642:
URL: https://github.com/apache/spark/pull/43642#issuecomment-1791865923

   > Could you check again? IIRC, there should be more than 40+ files involved 
in this issue...
   
   Thanks @LuciferYang ... checking


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45774][CORE][UI] Support `spark.ui.historyServerUrl` in `ApplicationPage` [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun commented on PR #43643:
URL: https://github.com/apache/spark/pull/43643#issuecomment-1791863587

   Could you review this, @LuciferYang ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45776][CORE] Remove the defensive null check for `MapOutputTrackerMaster#unregisterShuffle` added in SPARK-39553 [spark]

2023-11-02 Thread via GitHub



LuciferYang opened a new pull request, #43644:
URL: https://github.com/apache/spark/pull/43644

   ### What changes were proposed in this pull request?
   This pr Remove the defensive null check for 
`MapOutputTrackerMaster#unregisterShuffle` added in SPARK-39553.
   
   ### Why are the changes needed?
   https://github.com/scala/bug/issues/12613 has been fixed in Scala 2.13.9.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Existing test like `SPARK-39553: Multi-thread unregister shuffle shouldn't 
throw NPE` in `MapOutputTrackerSuite`
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

2023-11-02 Thread via GitHub



LuciferYang commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1381132422


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##
@@ -2766,7 +2767,8 @@ class DataFrameSuite extends QueryTest
 // The data set has 2 partitions, so Spark will write at least 2 json 
files.
 // Use a non-splittable compression (gzip), to make sure the json scan 
RDD has at least 2
 // partitions.
-.write.partitionBy("p").option("compression", 
"gzip").json(path.getCanonicalPath)
+.write.partitionBy("p")
+.option("compression", 
GZIP.lowerCaseName()).json(path.getCanonicalPath)

Review Comment:
   @pan3793 Could these cases possibly manifest as compilation warnings? Or are 
they merely suggestions for best practices?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45774][CORE][UI] Support `spark.ui.historyServerUrl` in `ApplicationPage` [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun opened a new pull request, #43643:
URL: https://github.com/apache/spark/pull/43643

   …
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage `View.force` and `ScalaNumberProxy.signum` [spark]

2023-11-02 Thread via GitHub



LuciferYang commented on PR #43637:
URL: https://github.com/apache/spark/pull/43637#issuecomment-1791845304

   Merged into master for Spark 4.0. Thanks @ivoson and @dongjoon-hyun ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage `View.force` and `ScalaNumberProxy.signum` [spark]

2023-11-02 Thread via GitHub



LuciferYang closed pull request #43637: [SPARK-45694][SPARK-45695][SQL] Clean 
up deprecated API usage `View.force` and `ScalaNumberProxy.signum`
URL: https://github.com/apache/spark/pull/43637


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45687][SQL] Fix `Passing an explicit array value to a Scala varargs method is deprecated` [spark]

2023-11-02 Thread via GitHub



LuciferYang commented on PR #43642:
URL: https://github.com/apache/spark/pull/43642#issuecomment-1791842647

   for example:
   
   
https://github.com/apache/spark/blob/eda9911057b893e42f49dbd7448f20f91f2798c4/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L182
   
   
https://github.com/apache/spark/blob/eda9911057b893e42f49dbd7448f20f91f2798c4/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1047
   
   and 
   
   
https://github.com/apache/spark/blob/eda9911057b893e42f49dbd7448f20f91f2798c4/examples/src/main/scala/org/apache/spark/examples/graphx/Analytics.scala#L54


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45687][SQL] Fix `Passing an explicit array value to a Scala varargs method is deprecated` [spark]

2023-11-02 Thread via GitHub



LuciferYang commented on PR #43642:
URL: https://github.com/apache/spark/pull/43642#issuecomment-1791840682

   Could you check again? IIRC, there should be more than 40+ files involved in 
this issue...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45687][SQL] Fix `Passing an explicit array value to a Scala varargs method is deprecated` [spark]

2023-11-02 Thread via GitHub



ivoson commented on PR #43642:
URL: https://github.com/apache/spark/pull/43642#issuecomment-1791831105

   cc @LuciferYang please take a look at this PR. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45687][SQL] Fix `Passing an explicit array value to a Scala varargs method is deprecated` [spark]

2023-11-02 Thread via GitHub



ivoson opened a new pull request, #43642:
URL: https://github.com/apache/spark/pull/43642

   ### What changes were proposed in this pull request?
   Fix the deprecated behavior below: 
   `Passing an explicit array value to a Scala varargs method is deprecated 
(since 2.13.0) and will result in a defensive copy; Use the more efficient 
non-copying ArraySeq.unsafeWrapArray or an explicit toIndexedSeq call`
   
   It exists in two test suites: `AggregationQuerySuite ` and 
`ObjectHashAggregateSuite `, for the cases in these 2 test suites, we can take 
the non-copying method: `ArraySeq.unsafeWrapArray`
   
   
   ### Why are the changes needed?
   Eliminate compile warnings and no longer use deprecated scala APIs.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Pass GA.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage `View.force` and `ScalaNumberProxy.signum` [spark]

2023-11-02 Thread via GitHub



ivoson commented on PR #43637:
URL: https://github.com/apache/spark/pull/43637#issuecomment-1791823766

   > https://user-images.githubusercontent.com/1475305/280160599-56b53eae-d7fc-4b21-9424-317c5a3daee5.png;>
   > @ivoson Please update the pr description.
   
   Thanks, @LuciferYang . done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage `View.force` and `ScalaNumberProxy.signum` [spark]

2023-11-02 Thread via GitHub



ivoson commented on PR #43637:
URL: https://github.com/apache/spark/pull/43637#issuecomment-1791823590

   > Could you re-trigger the failed test pipeline, @ivoson ?
   
   Hi @dongjoon-hyun done, the test passed now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

2023-11-02 Thread via GitHub



beliefer commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1381060163


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##
@@ -2766,7 +2767,8 @@ class DataFrameSuite extends QueryTest
 // The data set has 2 partitions, so Spark will write at least 2 json 
files.
 // Use a non-splittable compression (gzip), to make sure the json scan 
RDD has at least 2
 // partitions.
-.write.partitionBy("p").option("compression", 
"gzip").json(path.getCanonicalPath)
+.write.partitionBy("p")
+.option("compression", 
GZIP.lowerCaseName()).json(path.getCanonicalPath)

Review Comment:
   Because there are already a lot of Java no-arg method called with ().
   Let's update them if we really have the strong requirements.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

2023-11-02 Thread via GitHub



beliefer commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1381041537


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CompressionCodecs.scala:
##
@@ -21,19 +21,15 @@ import java.util.Locale
 
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.io.SequenceFile.CompressionType
-import org.apache.hadoop.io.compress._
 
 import org.apache.spark.util.Utils
 
 object CompressionCodecs {
-  private val shortCompressionCodecNames = Map(
-"none" -> null,
-"uncompressed" -> null,
-"bzip2" -> classOf[BZip2Codec].getName,
-"deflate" -> classOf[DeflateCodec].getName,
-"gzip" -> classOf[GzipCodec].getName,
-"lz4" -> classOf[Lz4Codec].getName,
-"snappy" -> classOf[SnappyCodec].getName)
+  private val shortCompressionCodecNames = HadoopCompressionCodec.values().map 
{ codec =>
+val className =
+  if (codec.getCompressionCodec == null) null else 
codec.getCompressionCodec.getClass.getName
+codec.lowerCaseName() -> className

Review Comment:
   Thank you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

2023-11-02 Thread via GitHub



beliefer commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1381028768


##
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/HadoopCompressionCodec.java:
##
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.Arrays;
+import java.util.Locale;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import org.apache.hadoop.io.compress.BZip2Codec;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.DeflateCodec;
+import org.apache.hadoop.io.compress.GzipCodec;
+import org.apache.hadoop.io.compress.Lz4Codec;
+import org.apache.hadoop.io.compress.SnappyCodec;
+
+/**
+ * A mapper class from Spark supported hadoop compression codecs to hadoop 
compression codecs.
+ */
+public enum HadoopCompressionCodec {
+  NONE(null),
+  UNCOMPRESSED(null),
+  BZIP2(new BZip2Codec()),
+  DEFLATE(new DeflateCodec()),
+  GZIP(new GzipCodec()),
+  LZ4(new Lz4Codec()),
+  SNAPPY(new SnappyCodec());
+
+  // TODO supports ZStandardCodec
+
+  private final CompressionCodec compressionCodec;
+
+  HadoopCompressionCodec(CompressionCodec compressionCodec) {
+this.compressionCodec = compressionCodec;
+  }
+
+  public CompressionCodec getCompressionCodec() {
+return this.compressionCodec;
+  }
+
+  private static final Map codecNameMap =
+Arrays.stream(HadoopCompressionCodec.values()).collect(
+  Collectors.toMap(codec -> codec.name(), codec -> 
codec.name().toLowerCase(Locale.ROOT)));

Review Comment:
   I feel like both are OK.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45688][SPARK-45693][CORE] Clean up the deprecated API usage related to `MapOps` & Fix `method += in trait Growable is deprecated` [spark]

2023-11-02 Thread via GitHub



panbingkun commented on PR #43578:
URL: https://github.com/apache/spark/pull/43578#issuecomment-1791770240

   > let's rebase this one @panbingkun
   
   Done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45688][SPARK-45693][CORE] Clean up the deprecated API usage related to `MapOps` & Fix `method += in trait Growable is deprecated` [spark]

2023-11-02 Thread via GitHub



panbingkun commented on code in PR #43578:
URL: https://github.com/apache/spark/pull/43578#discussion_r1380960587


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeNamespaceExec.scala:
##
@@ -46,7 +46,8 @@ case class DescribeNamespaceExec(
 }
 
 if (isExtended) {
-  val properties = metadata.asScala -- 
CatalogV2Util.NAMESPACE_RESERVED_PROPERTIES
+  val properties = metadata.asScala.filterNot(

Review Comment:
   Good suggestion!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45688][SPARK-45693][CORE] Clean up the deprecated API usage related to `MapOps` & Fix `method += in trait Growable is deprecated` [spark]

2023-11-02 Thread via GitHub



panbingkun commented on code in PR #43578:
URL: https://github.com/apache/spark/pull/43578#discussion_r1380960343


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalogSuite.scala:
##
@@ -825,9 +825,8 @@ class V2SessionCatalogNamespaceSuite extends 
V2SessionCatalogBaseSuite {
   expected: scala.collection.Map[String, String],
   actual: scala.collection.Map[String, String]): Unit = {
 // remove location and comment that are automatically added by HMS unless 
they are expected
-val toRemove =
-  CatalogV2Util.NAMESPACE_RESERVED_PROPERTIES.filter(expected.contains)
-assert(expected -- toRemove === actual)
+val toRemove = CatalogV2Util.NAMESPACE_RESERVED_PROPERTIES.toSet

Review Comment:
   Okay.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage `View.force` and `ScalaNumberProxy.signum` [spark]

2023-11-02 Thread via GitHub



LuciferYang commented on PR #43637:
URL: https://github.com/apache/spark/pull/43637#issuecomment-1791748313

   https://github.com/apache/spark/assets/1475305/56b53eae-d7fc-4b21-9424-317c5a3daee5;>
   
   @ivoson Please update the pr description.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45768][SQL][PYTHON] Make faulthandler a runtime configuration for Python execution in SQL [spark]

2023-11-02 Thread via GitHub



HyukjinKwon closed pull request #43635: [SPARK-45768][SQL][PYTHON] Make 
faulthandler a runtime configuration for Python execution in SQL
URL: https://github.com/apache/spark/pull/43635


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45768][SQL][PYTHON] Make faulthandler a runtime configuration for Python execution in SQL [spark]

2023-11-02 Thread via GitHub



HyukjinKwon commented on PR #43635:
URL: https://github.com/apache/spark/pull/43635#issuecomment-1791743103

   All related tests passed.
   
   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45742][CORE][FOLLOWUP] Remove unnecessary null check from `ArrayImplicits.SparkArrayOps#toImmutableArraySeq` [spark]

2023-11-02 Thread via GitHub



LuciferYang opened a new pull request, #43641:
URL: https://github.com/apache/spark/pull/43641

   ### What changes were proposed in this pull request?
   The implementation of the `mmutable.ArraySeq.unsafeWrapArray` function is as 
follows:
   
   ```scala
 def unsafeWrapArray[T](x: Array[T]): ArraySeq[T] = ((x: @unchecked) match {
   case null  => null
   case x: Array[AnyRef]  => new ofRef[AnyRef](x)
   case x: Array[Int] => new ofInt(x)
   case x: Array[Double]  => new ofDouble(x)
   case x: Array[Long]=> new ofLong(x)
   case x: Array[Float]   => new ofFloat(x)
   case x: Array[Char]=> new ofChar(x)
   case x: Array[Byte]=> new ofByte(x)
   case x: Array[Short]   => new ofShort(x)
   case x: Array[Boolean] => new ofBoolean(x)
   case x: Array[Unit]=> new ofUnit(x)
 }).asInstanceOf[ArraySeq[T]]
   ```
   
   The first case of match is null, there is no need to do another manual null 
check, so this PR removes it.
   
   
   ### Why are the changes needed?
   Remove unnecessary null check from 
`ArrayImplicits.SparkArrayOps#toImmutableArraySeq`
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Existing test cases, such as `ArrayImplicitsSuite`.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader [spark]

2023-11-02 Thread via GitHub



HyukjinKwon commented on code in PR #43630:
URL: https://github.com/apache/spark/pull/43630#discussion_r1380920238


##
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##
@@ -208,10 +209,45 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
   throw QueryCompilationErrors.pathOptionNotSetCorrectlyWhenReadingError()
 }
 
-DataSource.lookupDataSourceV2(source, 
sparkSession.sessionState.conf).flatMap { provider =>
-  DataSourceV2Utils.loadV2Source(sparkSession, provider, 
userSpecifiedSchema, extraOptions,
-source, paths: _*)
-}.getOrElse(loadV1Source(paths: _*))
+val isUserDefinedDataSource =

Review Comment:
   Let's at least separate the logic into a separate function if possible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45770][SQL][PYTHON][CONNECT] Fix column resolution in `DataFrame.drop` [spark]

2023-11-02 Thread via GitHub



zhengruifeng commented on code in PR #43632:
URL: https://github.com/apache/spark/pull/43632#discussion_r1380919985


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -3064,13 +3064,31 @@ class Dataset[T] private[sql](
*/
   @scala.annotation.varargs
   def drop(col: Column, cols: Column*): DataFrame = {
-val allColumns = col +: cols
-val expressions = (for (col <- allColumns) yield col match {
+val expressions = (col +: cols).map {
   case Column(u: UnresolvedAttribute) =>
-queryExecution.analyzed.resolveQuoted(

Review Comment:
   I think new plan `DropColumns` should be a better approach



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45639][SQL][PYTHON] Support loading Python data sources in DataFrameReader [spark]

2023-11-02 Thread via GitHub



HyukjinKwon commented on code in PR #43630:
URL: https://github.com/apache/spark/pull/43630#discussion_r1380919626


##
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##
@@ -208,10 +209,45 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
   throw QueryCompilationErrors.pathOptionNotSetCorrectlyWhenReadingError()
 }
 
-DataSource.lookupDataSourceV2(source, 
sparkSession.sessionState.conf).flatMap { provider =>
-  DataSourceV2Utils.loadV2Source(sparkSession, provider, 
userSpecifiedSchema, extraOptions,
-source, paths: _*)
-}.getOrElse(loadV1Source(paths: _*))
+val isUserDefinedDataSource =

Review Comment:
   @cloud-fan @allisonwang-db do we want to support this datasource via `USING` 
syntax unlike DSv2, right?
   
   If that's the case, the logics of loading DataSource has to be within 
`DataSource.lookupDataSource` and/or `DataSource.providingInstance`. I don't 
think we should mix the logics here with DSv2.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44419][SQL] Support to extract partial filters of datasource v2 table and push them down [spark]

2023-11-02 Thread via GitHub



github-actions[bot] closed pull request #42000: [SPARK-44419][SQL] Support to 
extract partial filters of datasource v2 table and push them down
URL: https://github.com/apache/spark/pull/42000


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44065][SQL] Optimize BroadcastHashJoin skew in OptimizeSkewedJoin [spark]

2023-11-02 Thread via GitHub



github-actions[bot] commented on PR #41609:
URL: https://github.com/apache/spark/pull/41609#issuecomment-1791732437

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44426][SQL] Optimize adaptive skew join for ExistenceJoin [spark]

2023-11-02 Thread via GitHub



github-actions[bot] closed pull request #42003: [SPARK-44426][SQL] Optimize 
adaptive skew join for ExistenceJoin
URL: https://github.com/apache/spark/pull/42003


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-44517][SQL] Respect ignorenulls and child's nullability in first [spark]

2023-11-02 Thread via GitHub



github-actions[bot] commented on PR #42117:
URL: https://github.com/apache/spark/pull/42117#issuecomment-1791732388

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [WIP][SPARK-4836][UI] Show all stage attempts on UI's job details page [spark]

2023-11-02 Thread via GitHub



JoshRosen commented on PR #43640:
URL: https://github.com/apache/spark/pull/43640#issuecomment-1791732011

   This is an attempt at fixing a nearly nine year old Spark UI nit 
(https://issues.apache.org/jira/browse/SPARK-4836). I'm opening this draft PR 
early to get feedback on a couple of design questions:
   
   1. Should this be configuration-flaggable?
   2. How should we handle sorting of the stage table? It looks like stage 
sorting is done server-side and can only be sorted on a single column. By 
default the table is sorted on Stage ID.
   3. Will the "initially-skipped-but-subsequently-retried" stages semantic be 
confusing to users?
  
- See [the 
screenshot](https://user-images.githubusercontent.com/50748/280157322-90abb73a-c8f0-41fc-98ba-71de2300b349.png)
 for an example: in that case, I'm running a job that re-uses the shuffle 
output of a previous job and hits a fetch failure, causing a recomputation of 
that initially-skipped stage.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45773][PYTHON][DOCS] Refine docstring of SparkSession.builder.config [spark]

2023-11-02 Thread via GitHub



HyukjinKwon commented on code in PR #43639:
URL: https://github.com/apache/spark/pull/43639#discussion_r1380909244


##
python/pyspark/sql/session.py:
##
@@ -253,20 +253,32 @@ def config(
 ---
 :class:`SparkSession.Builder`
 
+See Also
+
+:class:`SparkConf`
+
 Examples
 
-For an existing class:`SparkConf`, use `conf` parameter.
+For an existing :class:`SparkConf`, use `conf` parameter.
 
 >>> from pyspark.conf import SparkConf
->>> SparkSession.builder.config(conf=SparkConf())
+>>> conf = SparkConf().setAppName("example").setMaster("local")
+>>> SparkSession.builder.config(conf=conf)
 >> SparkSession.builder.config("spark.some.config.option", 
"some-value")
 >> SparkSession.builder \\
+... .config("spark.some.config.number", 123) \\
+... .config("spark.some.config.float", 0.123)

Review Comment:
   Using backslashes are sort of implicitly discouraged in PEP8.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45773][PYTHON][DOCS] Refine docstring of SparkSession.builder.config [spark]

2023-11-02 Thread via GitHub



HyukjinKwon commented on code in PR #43639:
URL: https://github.com/apache/spark/pull/43639#discussion_r1380909115


##
python/pyspark/sql/session.py:
##
@@ -253,20 +253,32 @@ def config(
 ---
 :class:`SparkSession.Builder`
 
+See Also
+
+:class:`SparkConf`
+
 Examples
 
-For an existing class:`SparkConf`, use `conf` parameter.
+For an existing :class:`SparkConf`, use `conf` parameter.
 
 >>> from pyspark.conf import SparkConf
->>> SparkSession.builder.config(conf=SparkConf())
+>>> conf = SparkConf().setAppName("example").setMaster("local")
+>>> SparkSession.builder.config(conf=conf)
 >> SparkSession.builder.config("spark.some.config.option", 
"some-value")
 >> SparkSession.builder \\
+... .config("spark.some.config.number", 123) \\
+... .config("spark.some.config.float", 0.123)

Review Comment:
   ```suggestion
   >>> SparkSession.builder.config(
   ... "spark.some.config.number", 123).config(
   ... "spark.some.config.float", 0.123)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45773][PYTHON][DOCS] Refine docstring of SparkSession.builder.config [spark]

2023-11-02 Thread via GitHub



allisonwang-db opened a new pull request, #43639:
URL: https://github.com/apache/spark/pull/43639

   
   
   ### What changes were proposed in this pull request?
   
   This PR refines the docstring of the method `SparkSession.builder.config`.
   
   ### Why are the changes needed?
   
   To improve PySpark documentation.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   doc test
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45768][SQL][PYTHON] Make faulthandler a runtime configuration for Python execution in SQL [spark]

2023-11-02 Thread via GitHub



HyukjinKwon commented on PR #43635:
URL: https://github.com/apache/spark/pull/43635#issuecomment-1791657207

   Thanks all, addressed!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45688][SPARK-45693][CORE] Clean up the deprecated API usage related to `MapOps` & Fix `method += in trait Growable is deprecated` [spark]

2023-11-02 Thread via GitHub



panbingkun commented on PR #43578:
URL: https://github.com/apache/spark/pull/43578#issuecomment-1791642047

   > let's  rebase this one @panbingkun 
   
   Okay, thank you very much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-02 Thread via GitHub



panbingkun commented on PR #37588:
URL: https://github.com/apache/spark/pull/37588#issuecomment-1791622292

   > I think it's pretty close now, thanks for your patience!
   
   I will update it again today. 
   Thank you very much for your patience and seriousness, which has been a 
great help to me!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage `View.force` and `ScalaNumberProxy.signum` [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun commented on PR #43637:
URL: https://github.com/apache/spark/pull/43637#issuecomment-1791604791

   Could you re-trigger the failed test pipeline, @ivoson ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage View.force and ScalaNumberProxy.signum [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun commented on code in PR #43637:
URL: https://github.com/apache/spark/pull/43637#discussion_r1380825014


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala:
##
@@ -362,10 +362,8 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]]
   case s: Seq[_] =>
 s.map(mapChild)
   case m: Map[_, _] =>
-// `map.mapValues().view.force` return `Map` in Scala 2.12 but return 
`IndexedSeq` in Scala
-// 2.13, call `toMap` method manually to compatible with Scala 2.12 
and Scala 2.13
-// `mapValues` is lazy and we need to force it to materialize
-m.view.mapValues(mapChild).view.force.toMap
+// `mapValues` is lazy and we need to force it to materialize by 
converting to Map

Review Comment:
   Is this, `we need to force it to materialize`, correct with the new code?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage View.force and ScalaNumberProxy.signum [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun commented on code in PR #43637:
URL: https://github.com/apache/spark/pull/43637#discussion_r1380825014


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala:
##
@@ -362,10 +362,8 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]]
   case s: Seq[_] =>
 s.map(mapChild)
   case m: Map[_, _] =>
-// `map.mapValues().view.force` return `Map` in Scala 2.12 but return 
`IndexedSeq` in Scala
-// 2.13, call `toMap` method manually to compatible with Scala 2.12 
and Scala 2.13
-// `mapValues` is lazy and we need to force it to materialize
-m.view.mapValues(mapChild).view.force.toMap
+// `mapValues` is lazy and we need to force it to materialize by 
converting to Map

Review Comment:
   Is this, `we need to force it to materialize`, correct with the new code?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] Keep the taskBinary in Stage to avoid deserializing it multiple times, and send binary to executor instead of task. [spark]

2023-11-02 Thread via GitHub



mridulm commented on PR #43621:
URL: https://github.com/apache/spark/pull/43621#issuecomment-1791582645

   The broadcast variable is read once - not n times, and the deserialization n 
times prevents side effect between tasks, which would be a behavior change if 
we move away from it.
   
   On the performance aspect , as currently formulated, I would expect 
negligible (if any) difference - though would be happy to see numbers to the 
contrary to evaluate effectiveness !
   Additionally, given the possibility of side effect, these benefits should be 
very compelling to entertain if it is worth going down this path.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45757][ML] Avoid re-computation of NNZ in Binarizer [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun commented on PR #43619:
URL: https://github.com/apache/spark/pull/43619#issuecomment-1791561507

   Merged to master for Apache Spark 4.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45757][ML] Avoid re-computation of NNZ in Binarizer [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun closed pull request #43619: [SPARK-45757][ML] Avoid 
re-computation of NNZ in Binarizer
URL: https://github.com/apache/spark/pull/43619


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45768][SQL][PYTHON] Make faulthandler a runtime configuration for Python execution in SQL [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun commented on code in PR #43635:
URL: https://github.com/apache/spark/pull/43635#discussion_r1380796957


##
python/pyspark/sql/tests/test_udf.py:
##
@@ -22,6 +22,8 @@
 import unittest
 import datetime
 
+from py4j.protocol import Py4JJavaError

Review Comment:
   This seems to be a leftover. Shall we remove?
   ```
   ./python/pyspark/sql/tests/test_udf.py:25:1: F401 
'py4j.protocol.Py4JJavaError' imported but unused
   from py4j.protocol import Py4JJavaError
   ^
   1 F401 'py4j.protocol.Py4JJavaError' imported but unused
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45718][PS] Remove remaining deprecated Pandas features from Spark 3.4.0 [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun commented on PR #43581:
URL: https://github.com/apache/spark/pull/43581#issuecomment-1791556755

   Right. Finally, it's fixed and passed. Merged to master. Thank you, @itholic 
and @HyukjinKwon .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45718][PS] Remove remaining deprecated Pandas features from Spark 3.4.0 [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun closed pull request #43581: [SPARK-45718][PS] Remove remaining 
deprecated Pandas features from Spark 3.4.0
URL: https://github.com/apache/spark/pull/43581


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-41454][PYTHON] Support Python 3.11 [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun commented on PR #38987:
URL: https://github.com/apache/spark/pull/38987#issuecomment-1791554878

   > I've created a task for reverting that change for versions below 3.4. I 
will notify you in this thread with further information.
   
   Thank you for informing that, @mdhont . It's a great news and will be 
helpful in a way.
   
   BTW, just FYI, Apache Spark 3.3 will reach the End-Of-Support next Month 
(2023-12-15). Apache Spark community currently focuses on Apache Spark 3.4.2 
and 3.5.1 and 4.0.0 (next year).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45511][SS] State Data Source - Reader [spark]

2023-11-02 Thread via GitHub



rangadi commented on code in PR #43425:
URL: https://github.com/apache/spark/pull/43425#discussion_r1380760045


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala:
##
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.v2.state
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{GenericInternalRow, 
JoinedRow, UnsafeRow}
+import org.apache.spark.sql.connector.read.PartitionReader
+import org.apache.spark.sql.execution.datasources.v2.state.utils.SchemaUtil
+import org.apache.spark.sql.execution.streaming.state.{StateStore, 
StateStoreConf, StateStoreId, StateStoreProviderId}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableConfiguration
+
+class StatePartitionReader(

Review Comment:
   There is a comment, but it does not give any details about what it does. We 
could have some information useful for future readers of this code (and current 
reviewers :)). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45511][SS] State Data Source - Reader [spark]

2023-11-02 Thread via GitHub



anishshri-db commented on code in PR #43425:
URL: https://github.com/apache/spark/pull/43425#discussion_r1380729958


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala:
##
@@ -0,0 +1,216 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.v2.state
+
+import java.util
+import java.util.UUID
+
+import scala.util.control.NonFatal
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.sql.{RuntimeConfig, SparkSession}
+import org.apache.spark.sql.connector.catalog.{Table, TableProvider}
+import org.apache.spark.sql.connector.expressions.Transform
+import 
org.apache.spark.sql.execution.datasources.v2.state.StateDataSource.JoinSideValues.JoinSideValues
+import org.apache.spark.sql.execution.streaming.{CommitLog, OffsetSeqLog, 
OffsetSeqMetadata}
+import 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinHelper.{LeftSide,
 RightSide}
+import 
org.apache.spark.sql.execution.streaming.state.{StateSchemaCompatibilityChecker,
 StateStore, StateStoreConf, StateStoreId, StateStoreProviderId}
+import org.apache.spark.sql.sources.DataSourceRegister
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+/**
+ * An implementation of [[TableProvider]] with [[DataSourceRegister]] for 
State Store data source.
+ */
+class StateDataSource extends TableProvider with DataSourceRegister {
+  import StateDataSource._
+
+  private lazy val session: SparkSession = SparkSession.active
+
+  private lazy val hadoopConf: Configuration = 
session.sessionState.newHadoopConf()
+
+  override def shortName(): String = "statestore"
+
+  override def getTable(
+  schema: StructType,
+  partitioning: Array[Transform],
+  properties: util.Map[String, String]): Table = {
+val sourceOptions = StateSourceOptions.apply(session, hadoopConf, 
properties)
+val stateConf = buildStateStoreConf(sourceOptions.resolvedCpLocation, 
sourceOptions.batchId)
+new StateTable(session, schema, sourceOptions, stateConf)
+  }
+
+  override def inferSchema(options: CaseInsensitiveStringMap): StructType = {
+val partitionId = StateStore.PARTITION_ID_TO_CHECK_SCHEMA
+val sourceOptions = StateSourceOptions.apply(session, hadoopConf, options)
+if (sourceOptions.joinSide != JoinSideValues.none &&
+sourceOptions.storeName != StateStoreId.DEFAULT_STORE_NAME) {
+  throw new IllegalArgumentException(s"The options '$PARAM_JOIN_SIDE' and 
" +
+s"'$PARAM_STORE_NAME' cannot be specified together. Please specify 
either one.")
+}
+
+val stateCheckpointLocation = sourceOptions.stateCheckpointLocation
+
+try {
+  val (keySchema, valueSchema) = sourceOptions.joinSide match {
+case JoinSideValues.left =>
+  StreamStreamJoinStateHelper.readKeyValueSchema(session, 
stateCheckpointLocation.toString,
+sourceOptions.operatorId, LeftSide)
+
+case JoinSideValues.right =>
+  StreamStreamJoinStateHelper.readKeyValueSchema(session, 
stateCheckpointLocation.toString,
+sourceOptions.operatorId, RightSide)
+
+case JoinSideValues.none =>
+  val storeId = new StateStoreId(stateCheckpointLocation.toString, 
sourceOptions.operatorId,
+partitionId, sourceOptions.storeName)
+  val providerId = new StateStoreProviderId(storeId, UUID.randomUUID())
+  val manager = new StateSchemaCompatibilityChecker(providerId, 
hadoopConf)
+  manager.readSchemaFile()
+  }
+
+  new StructType()
+.add("key", keySchema)
+.add("value", valueSchema)
+} catch {
+  case NonFatal(e) =>
+throw new IllegalArgumentException("Fail to read the state schema. 
Either the file " +

Review Comment:
   Nit: `Failed to read the`



##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSourceReadSuite.scala:
##
@@ -0,0 +1,779 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ *

Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun closed pull request #43638: [SPARK-45771][CORE] Enable 
`spark.eventLog.rolling.enabled` by default
URL: https://github.com/apache/spark/pull/43638


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]

2023-11-02 Thread via GitHub



viirya commented on PR #43638:
URL: https://github.com/apache/spark/pull/43638#issuecomment-1791474994

   Oh, got it, existing one looks good. From the diff, I cannot see it so I 
thought `SingleEventLogFileWriter` isn't tested.
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun commented on PR #43638:
URL: https://github.com/apache/spark/pull/43638#issuecomment-1791475462

   
   Thank you for your confirmation! Merged to master for Apache Spark 4.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun commented on PR #43638:
URL: https://github.com/apache/spark/pull/43638#issuecomment-1791468911

   Just for your confirmation. I keep the existing test structure.
   
   ```
   buildWriterAndVerify(conf, classOf[RollingEventLogFilesWriter])
   
   conf.set(EVENT_LOG_ENABLE_ROLLING, true)
   buildWriterAndVerify(conf, classOf[RollingEventLogFilesWriter])
   conf.set(EVENT_LOG_ENABLE_ROLLING, false)
   buildWriterAndVerify(conf, classOf[SingleEventLogFileWriter])
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun commented on code in PR #43638:
URL: https://github.com/apache/spark/pull/43638#discussion_r1380713182


##
core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala:
##
@@ -163,6 +163,7 @@ class EventLoggingListenerSuite extends SparkFunSuite with 
LocalSparkContext wit
   test("SPARK-31764: isBarrier should be logged in event log") {
 val conf = new SparkConf()
 conf.set(EVENT_LOG_ENABLED, true)
+conf.set(EVENT_LOG_ENABLE_ROLLING, false)

Review Comment:
   Yes, this test case tries to read the event log file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]

2023-11-02 Thread via GitHub



viirya commented on code in PR #43638:
URL: https://github.com/apache/spark/pull/43638#discussion_r1380704945


##
core/src/test/scala/org/apache/spark/deploy/history/EventLogFileWritersSuite.scala:
##
@@ -66,7 +66,7 @@ abstract class EventLogFileWritersSuite extends SparkFunSuite 
with LocalSparkCon
 conf.set(EVENT_LOG_DIR, testDir.toString)
 
 // default config
-buildWriterAndVerify(conf, classOf[SingleEventLogFileWriter])
+buildWriterAndVerify(conf, classOf[RollingEventLogFilesWriter])
 
 conf.set(EVENT_LOG_ENABLE_ROLLING, true)

Review Comment:
   Or we want to:
   ```suggestion
   conf.set(EVENT_LOG_ENABLE_ROLLING, false)
   buildWriterAndVerify(conf, classOf[SingleEventLogFileWriter])
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]

2023-11-02 Thread via GitHub



viirya commented on code in PR #43638:
URL: https://github.com/apache/spark/pull/43638#discussion_r1380705437


##
core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala:
##
@@ -163,6 +163,7 @@ class EventLoggingListenerSuite extends SparkFunSuite with 
LocalSparkContext wit
   test("SPARK-31764: isBarrier should be logged in event log") {
 val conf = new SparkConf()
 conf.set(EVENT_LOG_ENABLED, true)
+conf.set(EVENT_LOG_ENABLE_ROLLING, false)

Review Comment:
   Is it failed without setting to false?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]

2023-11-02 Thread via GitHub



viirya commented on code in PR #43638:
URL: https://github.com/apache/spark/pull/43638#discussion_r1380704238


##
core/src/test/scala/org/apache/spark/deploy/history/EventLogFileWritersSuite.scala:
##
@@ -66,7 +66,7 @@ abstract class EventLogFileWritersSuite extends SparkFunSuite 
with LocalSparkCon
 conf.set(EVENT_LOG_DIR, testDir.toString)
 
 // default config
-buildWriterAndVerify(conf, classOf[SingleEventLogFileWriter])
+buildWriterAndVerify(conf, classOf[RollingEventLogFilesWriter])
 
 conf.set(EVENT_LOG_ENABLE_ROLLING, true)
 buildWriterAndVerify(conf, classOf[RollingEventLogFilesWriter])

Review Comment:
   ```suggestion
   ```



##
core/src/test/scala/org/apache/spark/deploy/history/EventLogFileWritersSuite.scala:
##
@@ -66,7 +66,7 @@ abstract class EventLogFileWritersSuite extends SparkFunSuite 
with LocalSparkCon
 conf.set(EVENT_LOG_DIR, testDir.toString)
 
 // default config
-buildWriterAndVerify(conf, classOf[SingleEventLogFileWriter])
+buildWriterAndVerify(conf, classOf[RollingEventLogFilesWriter])
 
 conf.set(EVENT_LOG_ENABLE_ROLLING, true)

Review Comment:
   Is it redundant then?
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]

2023-11-02 Thread via GitHub



viirya commented on code in PR #43638:
URL: https://github.com/apache/spark/pull/43638#discussion_r1380703869


##
core/src/test/scala/org/apache/spark/deploy/history/EventLogFileWritersSuite.scala:
##
@@ -66,7 +66,7 @@ abstract class EventLogFileWritersSuite extends SparkFunSuite 
with LocalSparkCon
 conf.set(EVENT_LOG_DIR, testDir.toString)
 
 // default config
-buildWriterAndVerify(conf, classOf[SingleEventLogFileWriter])
+buildWriterAndVerify(conf, classOf[RollingEventLogFilesWriter])
 
 conf.set(EVENT_LOG_ENABLE_ROLLING, true)

Review Comment:
   Is it redundant then?
   ```suggestion
   conf.set(EVENT_LOG_ENABLE_ROLLING, true)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun commented on PR #43638:
URL: https://github.com/apache/spark/pull/43638#issuecomment-1791433846

   AppVeyor failure (SparkR) is irrelevant to this PR. 
   
   Could you review this PR when you have some time, @viirya?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45733][CONNECT][PYTHON] Support multiple retry policies [spark]

2023-11-02 Thread via GitHub



juliuszsompolski commented on code in PR #43591:
URL: https://github.com/apache/spark/pull/43591#discussion_r1380491812


##
python/pyspark/sql/connect/client/core.py:
##
@@ -550,13 +545,11 @@ def fromProto(cls, pb: pb2.ConfigResponse) -> 
"ConfigResult":
 )
 
 
-class SparkConnectClient(object):
-"""
-Conceptually the remote spark session that communicates with the server
-"""
+class DefaultPolicy(RetryPolicy):

Review Comment:
   nit: DefaultRetryPolicy



##
python/pyspark/sql/connect/client/core.py:
##
@@ -688,6 +695,14 @@ def enable_reattachable_execute(self) -> 
"SparkConnectClient":
 self._use_reattachable_execute = True
 return self
 
+def register_retry_policy(self, policy: RetryPolicy):
+if policy.name in self._known_retry_policies:
+raise ValueError("Already known policy")
+self._known_retry_policies[policy.name] = policy
+
+def set_retry_policies(self, policies: List[str]):
+self._retry_policies = [self._known_retry_policies[name] for name in 
policies]

Review Comment:
   do we need the functionality of registering policies that we are not going 
to be using?
   is there a use case for adding and removing policies?
   unless there's a good reason to have a separate `register` and `set`, maybe 
simplify it and just set is enough?



##
python/pyspark/sql/connect/client/retries.py:
##
@@ -0,0 +1,228 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import random
+import time
+import typing
+from typing import Optional, Callable, Generator, List, Type
+from types import TracebackType
+from pyspark.sql.connect.client.logging import logger
+
+"""
+This module contains retry system. The system is designed to be
+significantly customizable.
+
+A key aspect of retries is RetryPolicy class, describing a single policy.
+There can be more than one policy defined at the same time. Each policy
+determines which error types it can retry and how exactly.
+
+For instance, networking errors should likely be retried differently that
+remote resource being available.
+
+Given a sequence of policies, retry logic applies all of them in sequential
+order, keeping track of different policies budgets.
+"""
+
+
+class RetryPolicy:
+"""
+Describes key aspects of RetryPolicy.
+
+It's advised that different policies are implemented as different 
subclasses.
+"""
+
+def __init__(
+self,
+max_retries: Optional[int] = None,
+initial_backoff: int = 1000,
+max_backoff: Optional[int] = None,
+backoff_multiplier: float = 1.0,
+jitter: int = 0,
+min_jitter_threshold: int = 0,
+):
+self.max_retries = max_retries
+self.initial_backoff = initial_backoff
+self.max_backoff = max_backoff
+self.backoff_multiplier = backoff_multiplier
+self.jitter = jitter
+self.min_jitter_threshold = min_jitter_threshold
+
+@property
+def name(self):
+return self.__class__.__name__
+
+def can_retry(self, exception: BaseException):
+return False
+
+def to_state(self) -> "RetryPolicyState":
+return RetryPolicyState(self)
+
+
+class RetryPolicyState:
+"""
+This class represents stateful part of the specific policy.
+"""
+
+def __init__(self, policy: RetryPolicy):
+self._policy = policy
+
+# Will allow attempts [0, self._policy.max_retries)
+self._attempt = 0
+self._next_wait: float = self._policy.initial_backoff
+
+@property
+def policy(self):
+return self._policy
+
+@property
+def name(self):
+return self.policy.name
+
+def can_retry(self, exception: BaseException):
+return self.policy.can_retry(exception)
+
+def next_attempt(self) -> Optional[int]:
+"""
+Returns
+---
+Randomized time (in milliseconds) to wait until this attempt
+or None if this policy doesn't allow more retries.
+"""
+
+if self.policy.max_retries is not None and self._attempt >= 
self.policy.max_retries:
+# No more retries under this policy
+return None
+
+

Re: [PR] [SPARK-45768][SQL][PYTHON] Make faulthandler a runtime configuration for Python execution in SQL [spark]

2023-11-02 Thread via GitHub



ueshin commented on code in PR #43635:
URL: https://github.com/apache/spark/pull/43635#discussion_r1380579484


##
python/pyspark/sql/tests/test_udf.py:
##
@@ -1020,6 +1022,15 @@ def test_udf(a):
 with self.assertRaisesRegex(PythonException, "StopIteration"):
 self.spark.range(10).select(test_udf(col("id"))).show()
 
+def test_python_udf_segfault(self):
+with 
self.sql_conf({"spark.sql.execution.pyspark.udf.faulthandler.enabled": True}):
+try:
+import ctypes
+
+self.spark.range(1).select(udf(lambda x: 
ctypes.string_at(0))("id")).collect()
+except Exception as e:
+self.assertRegex(str(e), "Segmentation fault")

Review Comment:
   nit: wondering if we can use `with self.assertRaisesRegex( ... )` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45771][CORE] Enable `spark.eventLog.rolling.enabled` by default [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun commented on code in PR #43638:
URL: https://github.com/apache/spark/pull/43638#discussion_r1380456215


##
core/src/test/scala/org/apache/spark/deploy/history/EventLogTestHelper.scala:
##
@@ -38,6 +38,7 @@ object EventLogTestHelper {
   def getLoggingConf(logDir: Path, compressionCodec: Option[String] = None): 
SparkConf = {
 val conf = new SparkConf
 conf.set(EVENT_LOG_ENABLED, true)
+conf.set(EVENT_LOG_ENABLE_ROLLING, false)

Review Comment:
   This is consistent with the existing function description.
   
   
https://github.com/apache/spark/blob/5970d353360d4fb6647c8fbc10f733abf009eca1/core/src/test/scala/org/apache/spark/deploy/history/EventLogTestHelper.scala#L35



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45688][SPARK-45693][CORE] Clean up the deprecated API usage related to `MapOps` & Fix `method += in trait Growable is deprecated` [spark]

2023-11-02 Thread via GitHub



LuciferYang commented on PR #43578:
URL: https://github.com/apache/spark/pull/43578#issuecomment-1791089211

   let's  rebase this one @panbingkun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45694][SPARK-45695][SQL] Clean up deprecated API usage View.force and ScalaNumberProxy.signum [spark]

2023-11-02 Thread via GitHub



ivoson commented on code in PR #43637:
URL: https://github.com/apache/spark/pull/43637#discussion_r1380412775


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala:
##
@@ -362,10 +362,8 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]]
   case s: Seq[_] =>
 s.map(mapChild)
   case m: Map[_, _] =>
-// `map.mapValues().view.force` return `Map` in Scala 2.12 but return 
`IndexedSeq` in Scala
-// 2.13, call `toMap` method manually to compatible with Scala 2.12 
and Scala 2.13
-// `mapValues` is lazy and we need to force it to materialize
-m.view.mapValues(mapChild).view.force.toMap
+// `mapValues` is lazy and we need to force it to materialize by 
invoking `toIndexedSeq`
+m.view.mapValues(mapChild).view.toIndexedSeq.toMap

Review Comment:
   Yeah, it looks better. Thanks, done.



##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala:
##
@@ -784,13 +782,12 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]]
 arg.asInstanceOf[BaseType].clone()
   case Some(arg: TreeNode[_]) if containsChild(arg) =>
 Some(arg.asInstanceOf[BaseType].clone())
-  // `map.mapValues().view.force` return `Map` in Scala 2.12 but return 
`IndexedSeq` in Scala
-  // 2.13, call `toMap` method manually to compatible with Scala 2.12 and 
Scala 2.13
+  // `mapValues` is lazy and we need to force it to materialize by 
invoking `toIndexedSeq`
   case m: Map[_, _] => m.view.mapValues {
 case arg: TreeNode[_] if containsChild(arg) =>
   arg.asInstanceOf[BaseType].clone()
 case other => other
-  }.view.force.toMap // `mapValues` is lazy and we need to force it to 
materialize
+  }.view.toIndexedSeq.toMap

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

2023-11-02 Thread via GitHub



pan3793 commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1380365514


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CompressionCodecs.scala:
##
@@ -21,19 +21,15 @@ import java.util.Locale
 
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.io.SequenceFile.CompressionType
-import org.apache.hadoop.io.compress._
 
 import org.apache.spark.util.Utils
 
 object CompressionCodecs {
-  private val shortCompressionCodecNames = Map(
-"none" -> null,
-"uncompressed" -> null,
-"bzip2" -> classOf[BZip2Codec].getName,
-"deflate" -> classOf[DeflateCodec].getName,
-"gzip" -> classOf[GzipCodec].getName,
-"lz4" -> classOf[Lz4Codec].getName,
-"snappy" -> classOf[SnappyCodec].getName)
+  private val shortCompressionCodecNames = HadoopCompressionCodec.values().map 
{ codec =>
+val className =
+  if (codec.getCompressionCodec == null) null else 
codec.getCompressionCodec.getClass.getName
+codec.lowerCaseName() -> className

Review Comment:
   ```suggestion
   val className =
 if (codec.getCompressionCodec == null) null else 
codec.getCompressionCodec.getClass.getName
   codec.lowerCaseName() -> 
Option(codec.getCompressionCodec).map(_.getClass.getName).orNull
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

2023-11-02 Thread via GitHub



pan3793 commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1380368351


##
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala:
##
@@ -2766,7 +2767,8 @@ class DataFrameSuite extends QueryTest
 // The data set has 2 partitions, so Spark will write at least 2 json 
files.
 // Use a non-splittable compression (gzip), to make sure the json scan 
RDD has at least 2
 // partitions.
-.write.partitionBy("p").option("compression", 
"gzip").json(path.getCanonicalPath)
+.write.partitionBy("p")
+.option("compression", 
GZIP.lowerCaseName()).json(path.getCanonicalPath)

Review Comment:
   it's recommended to eliminate the `()` on calling Java no-arg method which 
has no side-effects
   ```suggestion
   .option("compression", 
GZIP.lowerCaseName).json(path.getCanonicalPath)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order [spark]

2023-11-02 Thread via GitHub



abellina commented on PR #43627:
URL: https://github.com/apache/spark/pull/43627#issuecomment-1791035949

   @tgravescs thanks for the review. I have handled your comments in this 
commit: 
https://github.com/apache/spark/pull/43627/commits/0bd7e990812d23166509ad6585c8d352f78e569f


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-45771][CORE] Enable spark.eventLog.rolling.enabled by default [spark]

2023-11-02 Thread via GitHub



dongjoon-hyun opened a new pull request, #43638:
URL: https://github.com/apache/spark/pull/43638

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45767][CORE] Delete `TimeStampedHashMap` and its UT [spark]

2023-11-02 Thread via GitHub



LuciferYang commented on PR #43633:
URL: https://github.com/apache/spark/pull/43633#issuecomment-1790910694

   Merged into master for Spark 4.0. Thanks @panbingkun @dongjoon-hyun and 
@beliefer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

2023-11-02 Thread via GitHub



pan3793 commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1380365514


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CompressionCodecs.scala:
##
@@ -21,19 +21,15 @@ import java.util.Locale
 
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.io.SequenceFile.CompressionType
-import org.apache.hadoop.io.compress._
 
 import org.apache.spark.util.Utils
 
 object CompressionCodecs {
-  private val shortCompressionCodecNames = Map(
-"none" -> null,
-"uncompressed" -> null,
-"bzip2" -> classOf[BZip2Codec].getName,
-"deflate" -> classOf[DeflateCodec].getName,
-"gzip" -> classOf[GzipCodec].getName,
-"lz4" -> classOf[Lz4Codec].getName,
-"snappy" -> classOf[SnappyCodec].getName)
+  private val shortCompressionCodecNames = HadoopCompressionCodec.values().map 
{ codec =>
+val className =
+  if (codec.getCompressionCodec == null) null else 
codec.getCompressionCodec.getClass.getName
+codec.lowerCaseName() -> className

Review Comment:
   ```suggestion
   codec.lowerCaseName() -> 
Option(codec.getCompressionCodec).map(_.getClass.getName).orNull
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45730][CORE] Make ReloadingX509TrustManagerSuite less flaky [spark]

2023-11-02 Thread via GitHub



hasnain-db commented on code in PR #43596:
URL: https://github.com/apache/spark/pull/43596#discussion_r1380349554


##
common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java:
##
@@ -161,14 +161,17 @@ public void testReload() throws Exception {
   // At this point we haven't reloaded, just the initial load
   assertEquals(0, tm.reloadCount);
 
+  // Wait so that the file modification time is different
+  Thread.sleep((tm.getReloadInterval() + 200));

Review Comment:
   thanks, fixing.
   
   also, for future people seeing this: the reason this is needed is that in 
certain scenarios this can run too fast, so the modification time of the 
original trust store and the one we create right after this is the same - in 
which case we do not reload the file, which fails this test.



##
common/network-common/src/test/java/org/apache/spark/network/ssl/ReloadingX509TrustManagerSuite.java:
##
@@ -161,14 +161,17 @@ public void testReload() throws Exception {
   // At this point we haven't reloaded, just the initial load
   assertEquals(0, tm.reloadCount);
 
+  // Wait so that the file modification time is different
+  Thread.sleep((tm.getReloadInterval() + 200));
+
   // Add another cert
   Map certs = new HashMap();
   certs.put("cert1", cert1);
   certs.put("cert2", cert2);
   createTrustStore(trustStore, "password", certs);
 
-  // Wait up to 5s until we reload
-  waitForReloadCount(tm, 1, 50);
+  // Wait up to 10s until we reload
+  waitForReloadCount(tm, 1, 100);

Review Comment:
   yeah, on a heavily loaded system (~40 load average on ~40 cores) this can 
take longer than 5s



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-02 Thread via GitHub



cloud-fan commented on code in PR #37588:
URL: https://github.com/apache/spark/pull/37588#discussion_r1380287495


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuiteBase.scala:
##
@@ -126,4 +158,258 @@ trait ShowTablesSuiteBase extends QueryTest with 
DDLCommandTestUtils {
   }
 }
   }
+
+  test("show table in a not existing namespace") {
+checkError(
+  exception = intercept[AnalysisException] {
+sql(s"SHOW TABLES IN $catalog.nonexist")
+  },
+  errorClass = "SCHEMA_NOT_FOUND",
+  parameters = Map("schemaName" -> "`nonexist`"))
+  }
+
+  test("show table extended in a not existing namespace") {
+checkError(
+  exception = intercept[AnalysisException] {
+sql(s"SHOW TABLE EXTENDED IN $catalog.nonexist LIKE '*tbl*'")
+  },
+  errorClass = "SCHEMA_NOT_FOUND",
+  parameters = Map("schemaName" -> "`nonexist`"))
+  }
+
+  test("show table extended in a not existing table") {
+val namespace = "ns1"
+val table = "nonexist"
+withNamespaceAndTable(namespace, table, catalog) { _ =>
+  val result = sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE 
'*$table*'")
+  assert(result.schema.fieldNames ===
+Seq("namespace", "tableName", "isTemporary", "information"))
+  assert(result.collect().isEmpty)
+}
+  }
+
+  test("show table extended in a not existing partition") {
+val namespace = "ns1"
+val table = "tbl"
+withNamespaceAndTable(namespace, table, catalog) { tbl =>
+  sql(s"CREATE TABLE $tbl (id bigint, data string) $defaultUsing 
PARTITIONED BY (id)")
+  sql(s"ALTER TABLE $tbl ADD PARTITION (id = 1)")
+  checkError(
+exception = intercept[AnalysisException] {
+  sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE '$table' 
PARTITION(id = 2)")
+},
+errorClass = "PARTITIONS_NOT_FOUND",
+parameters = Map(
+  "partitionList" -> "PARTITION (`id` = 2)",
+  "tableName" -> "`ns1`.`tbl`"
+)
+  )
+}
+  }
+
+  test("show table extended in non-partitioned table") {
+val namespace = "ns1"
+val table = "tbl"
+withNamespaceAndTable(namespace, table, catalog) { tbl =>
+  sql(s"CREATE TABLE $tbl (id bigint, data string) $defaultUsing")
+  val e = intercept[AnalysisException] {
+sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE '$table' 
PARTITION(id = 1)")
+  }
+  val (errorClass, parameters) = 
extendedPartInNonPartedTableError(catalog, namespace, table)
+  checkError(exception = e, errorClass = errorClass, parameters = 
parameters)
+}
+  }
+
+  test("show table extended in multi partition key - " +
+"the command's partition parameters are complete") {
+val namespace = "ns1"
+val table = "tbl"
+withNamespaceAndTable(namespace, table, catalog) { tbl =>
+  sql(s"CREATE TABLE $tbl (id1 bigint, id2 bigint, data string) " +
+s"$defaultUsing PARTITIONED BY (id1, id2)")
+  sql(s"ALTER TABLE $tbl ADD PARTITION (id1 = 1, id2 = 2)")
+
+  val result = sql(s"SHOW TABLE EXTENDED FROM $catalog.$namespace " +
+s"LIKE '$table' PARTITION(id1 = 1, id2 = 2)")
+  assert(result.schema.fieldNames ===
+Seq("namespace", "tableName", "isTemporary", "information"))
+  val resultCollect = result.collect()
+  assert(resultCollect(0).length == 4)
+  assert(resultCollect(0)(0) === namespace)
+  assert(resultCollect(0)(1) === table)
+  assert(resultCollect(0)(2) === false)
+  val actualResult = replace(resultCollect(0)(3).toString)
+  assert(actualResult === extendedPartExpectedResult)
+}
+  }
+
+  test("show table extended in multi partition key - " +
+"the command's partition parameters are incomplete") {
+val namespace = "ns1"
+val table = "tbl"
+withNamespaceAndTable(namespace, table, catalog) { tbl =>
+  sql(s"CREATE TABLE $tbl (id1 bigint, id2 bigint, data string) " +
+s"$defaultUsing PARTITIONED BY (id1, id2)")
+  sql(s"ALTER TABLE $tbl ADD PARTITION (id1 = 1, id2 = 2)")
+
+  checkError(
+exception = intercept[AnalysisException] {
+  sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace " +
+s"LIKE '$table' PARTITION(id1 = 1)")
+},
+errorClass = "_LEGACY_ERROR_TEMP_1232",
+parameters = Map(
+  "specKeys" -> "id1",
+  "partitionColumnNames" -> "id1, id2",
+  "tableName" -> s"`$catalog`.`$namespace`.`$table`")
+  )
+}
+  }
+
+  test("show table extended in multi tables") {
+val namespace = "ns1"
+val table = "tbl"
+withNamespaceAndTable(namespace, table, catalog) { _ =>
+  sql(s"CREATE TABLE $catalog.$namespace.$table (id bigint, data string) " 
+
+s"$defaultUsing PARTITIONED BY (id)")
+  val table1 = "tbl1"
+  val table2 = "tbl2"
+  withTable(table1, table2) {
+sql(s"CREATE TABLE $catalog.$namespace.$table1 (id1 bigint,

Re: [PR] [SPARK-45614][SQL] Assign names to error _LEGACY_ERROR_TEMP_215[6,7,8] [spark]

2023-11-02 Thread via GitHub



cloud-fan commented on code in PR #43481:
URL: https://github.com/apache/spark/pull/43481#discussion_r1380320255


##
sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala:
##
@@ -36,6 +36,7 @@ import 
org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext
 import org.apache.spark.sql.catalyst.expressions.objects.InitializeJavaBean
 import org.apache.spark.sql.catalyst.rules.RuleIdCollection
 import org.apache.spark.sql.catalyst.util.BadRecordException
+import org.apache.spark.sql.errors.DataTypeErrorsBase

Review Comment:
   nit: We are already in the package `org.apache.spark.sql.errors` and this 
import is useless.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-02 Thread via GitHub



cloud-fan commented on code in PR #37588:
URL: https://github.com/apache/spark/pull/37588#discussion_r1380274949


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuiteBase.scala:
##
@@ -40,6 +40,38 @@ trait ShowTablesSuiteBase extends QueryTest with 
DDLCommandTestUtils {
 checkAnswer(df, expected)
   }
 
+  // the error class & error parameters of
+  // `SHOW TABLE EXTENDED ... PARTITION ... in non-partitioned table`
+  protected def extendedPartInNonPartedTableError(
+  catalog: String,
+  namespace: String,
+  table: String): (String, Map[String, String])
+
+  protected def extendedPartExpectedResult: String =
+"Partition Values: [id1=1, id2=2]"
+
+  protected def namespaceKey: String = "Database"
+
+  protected def extendedTableExpectedResultDiff: String
+
+  private def extendedTableExpectedResult(
+  catalog: String,
+  namespaceName: String,

Review Comment:
   does this need to be a parameter? isn't it always the value of `def 
namespaceKey`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45758][SQL] Introduce a mapper for hadoop compression codecs [spark]

2023-11-02 Thread via GitHub



LuciferYang commented on code in PR #43620:
URL: https://github.com/apache/spark/pull/43620#discussion_r1380243958


##
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/HadoopCompressionCodec.java:
##
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.Arrays;
+import java.util.Locale;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import org.apache.hadoop.io.compress.BZip2Codec;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.DeflateCodec;
+import org.apache.hadoop.io.compress.GzipCodec;
+import org.apache.hadoop.io.compress.Lz4Codec;
+import org.apache.hadoop.io.compress.SnappyCodec;
+
+/**
+ * A mapper class from Spark supported hadoop compression codecs to hadoop 
compression codecs.
+ */
+public enum HadoopCompressionCodec {
+  NONE(null),
+  UNCOMPRESSED(null),
+  BZIP2(new BZip2Codec()),
+  DEFLATE(new DeflateCodec()),
+  GZIP(new GzipCodec()),
+  LZ4(new Lz4Codec()),
+  SNAPPY(new SnappyCodec());
+
+  // TODO supports ZStandardCodec
+
+  private final CompressionCodec compressionCodec;
+
+  HadoopCompressionCodec(CompressionCodec compressionCodec) {
+this.compressionCodec = compressionCodec;
+  }
+
+  public CompressionCodec getCompressionCodec() {
+return this.compressionCodec;
+  }
+
+  private static final Map codecNameMap =
+Arrays.stream(HadoopCompressionCodec.values()).collect(
+  Collectors.toMap(codec -> codec.name(), codec -> 
codec.name().toLowerCase(Locale.ROOT)));

Review Comment:
   ```suggestion
 Collectors.toMap(Enum::name, codec -> 
codec.name().toLowerCase(Locale.ROOT)));
   ```



##
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/HadoopCompressionCodec.java:
##
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util;
+
+import java.util.Arrays;
+import java.util.Locale;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import org.apache.hadoop.io.compress.BZip2Codec;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.DeflateCodec;
+import org.apache.hadoop.io.compress.GzipCodec;
+import org.apache.hadoop.io.compress.Lz4Codec;
+import org.apache.hadoop.io.compress.SnappyCodec;
+
+/**
+ * A mapper class from Spark supported hadoop compression codecs to hadoop 
compression codecs.
+ */
+public enum HadoopCompressionCodec {
+  NONE(null),
+  UNCOMPRESSED(null),
+  BZIP2(new BZip2Codec()),
+  DEFLATE(new DeflateCodec()),
+  GZIP(new GzipCodec()),
+  LZ4(new Lz4Codec()),
+  SNAPPY(new SnappyCodec());
+
+  // TODO supports ZStandardCodec
+
+  private final CompressionCodec compressionCodec;
+
+  HadoopCompressionCodec(CompressionCodec compressionCodec) {
+this.compressionCodec = compressionCodec;
+  }
+
+  public CompressionCodec getCompressionCodec() {
+return this.compressionCodec;
+  }
+
+  private static final Map codecNameMap =
+Arrays.stream(HadoopCompressionCodec.values()).collect(
+  Collectors.toMap(codec -> codec.name(), codec -> 
codec.name().toLowerCase(Locale.ROOT)));

Review Comment:
   nit: we can use method reference.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail:

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-02 Thread via GitHub



cloud-fan commented on code in PR #37588:
URL: https://github.com/apache/spark/pull/37588#discussion_r1380290514


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuiteBase.scala:
##
@@ -126,4 +158,258 @@ trait ShowTablesSuiteBase extends QueryTest with 
DDLCommandTestUtils {
   }
 }
   }
+
+  test("show table in a not existing namespace") {
+checkError(
+  exception = intercept[AnalysisException] {
+sql(s"SHOW TABLES IN $catalog.nonexist")
+  },
+  errorClass = "SCHEMA_NOT_FOUND",
+  parameters = Map("schemaName" -> "`nonexist`"))
+  }
+
+  test("show table extended in a not existing namespace") {
+checkError(
+  exception = intercept[AnalysisException] {
+sql(s"SHOW TABLE EXTENDED IN $catalog.nonexist LIKE '*tbl*'")
+  },
+  errorClass = "SCHEMA_NOT_FOUND",
+  parameters = Map("schemaName" -> "`nonexist`"))
+  }
+
+  test("show table extended in a not existing table") {
+val namespace = "ns1"
+val table = "nonexist"
+withNamespaceAndTable(namespace, table, catalog) { _ =>
+  val result = sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE 
'*$table*'")
+  assert(result.schema.fieldNames ===
+Seq("namespace", "tableName", "isTemporary", "information"))
+  assert(result.collect().isEmpty)
+}
+  }
+
+  test("show table extended in a not existing partition") {
+val namespace = "ns1"
+val table = "tbl"
+withNamespaceAndTable(namespace, table, catalog) { tbl =>
+  sql(s"CREATE TABLE $tbl (id bigint, data string) $defaultUsing 
PARTITIONED BY (id)")
+  sql(s"ALTER TABLE $tbl ADD PARTITION (id = 1)")
+  checkError(
+exception = intercept[AnalysisException] {
+  sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE '$table' 
PARTITION(id = 2)")
+},
+errorClass = "PARTITIONS_NOT_FOUND",
+parameters = Map(
+  "partitionList" -> "PARTITION (`id` = 2)",
+  "tableName" -> "`ns1`.`tbl`"
+)
+  )
+}
+  }
+
+  test("show table extended in non-partitioned table") {
+val namespace = "ns1"
+val table = "tbl"
+withNamespaceAndTable(namespace, table, catalog) { tbl =>
+  sql(s"CREATE TABLE $tbl (id bigint, data string) $defaultUsing")
+  val e = intercept[AnalysisException] {
+sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE '$table' 
PARTITION(id = 1)")
+  }
+  val (errorClass, parameters) = 
extendedPartInNonPartedTableError(catalog, namespace, table)
+  checkError(exception = e, errorClass = errorClass, parameters = 
parameters)
+}
+  }
+
+  test("show table extended in multi partition key - " +
+"the command's partition parameters are complete") {
+val namespace = "ns1"
+val table = "tbl"
+withNamespaceAndTable(namespace, table, catalog) { tbl =>
+  sql(s"CREATE TABLE $tbl (id1 bigint, id2 bigint, data string) " +
+s"$defaultUsing PARTITIONED BY (id1, id2)")
+  sql(s"ALTER TABLE $tbl ADD PARTITION (id1 = 1, id2 = 2)")
+
+  val result = sql(s"SHOW TABLE EXTENDED FROM $catalog.$namespace " +
+s"LIKE '$table' PARTITION(id1 = 1, id2 = 2)")
+  assert(result.schema.fieldNames ===
+Seq("namespace", "tableName", "isTemporary", "information"))
+  val resultCollect = result.collect()
+  assert(resultCollect(0).length == 4)
+  assert(resultCollect(0)(0) === namespace)
+  assert(resultCollect(0)(1) === table)
+  assert(resultCollect(0)(2) === false)
+  val actualResult = replace(resultCollect(0)(3).toString)
+  assert(actualResult === extendedPartExpectedResult)
+}
+  }
+
+  test("show table extended in multi partition key - " +
+"the command's partition parameters are incomplete") {
+val namespace = "ns1"
+val table = "tbl"
+withNamespaceAndTable(namespace, table, catalog) { tbl =>
+  sql(s"CREATE TABLE $tbl (id1 bigint, id2 bigint, data string) " +
+s"$defaultUsing PARTITIONED BY (id1, id2)")
+  sql(s"ALTER TABLE $tbl ADD PARTITION (id1 = 1, id2 = 2)")
+
+  checkError(
+exception = intercept[AnalysisException] {
+  sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace " +
+s"LIKE '$table' PARTITION(id1 = 1)")
+},
+errorClass = "_LEGACY_ERROR_TEMP_1232",
+parameters = Map(
+  "specKeys" -> "id1",
+  "partitionColumnNames" -> "id1, id2",
+  "tableName" -> s"`$catalog`.`$namespace`.`$table`")
+  )
+}
+  }
+
+  test("show table extended in multi tables") {
+val namespace = "ns1"
+val table = "tbl"
+withNamespaceAndTable(namespace, table, catalog) { _ =>
+  sql(s"CREATE TABLE $catalog.$namespace.$table (id bigint, data string) " 
+
+s"$defaultUsing PARTITIONED BY (id)")
+  val table1 = "tbl1"
+  val table2 = "tbl2"
+  withTable(table1, table2) {
+sql(s"CREATE TABLE $catalog.$namespace.$table1 (id1 bigint,

Re: [PR] [SPARK-45760][SQL] Add With expression to avoid duplicating expressions [spark]

2023-11-02 Thread via GitHub



wangyum commented on code in PR #43623:
URL: https://github.com/apache/spark/pull/43623#discussion_r1380292748


##
connector/connect/server/src/test/scala/org/apache/spark/sql/connect/ProtoToParsedPlanTestSuite.scala:
##
@@ -181,8 +183,15 @@ class ProtoToParsedPlanTestSuite
   val planner = new SparkConnectPlanner(SessionHolder.forTesting(spark))
   val catalystPlan =
 analyzer.executeAndCheck(planner.transformRelation(relation), new 
QueryPlanningTracker)
-  val actual =
-
removeMemoryAddress(normalizeExprIds(ReplaceExpressions(catalystPlan)).treeString)
+  val finalAnalyzedPlan = {
+object Helper extends RuleExecutor[LogicalPlan] {
+  val batches =
+Batch("Finish Analysis", Once, ReplaceExpressions) ::
+Batch("Rewrite With expression", FixedPoint(10), 
RewriteWithExpression) :: Nil

Review Comment:
   Please fix the Scala style.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-02 Thread via GitHub



cloud-fan commented on code in PR #37588:
URL: https://github.com/apache/spark/pull/37588#discussion_r1380291630


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/ShowTablesSuite.scala:
##
@@ -165,4 +154,63 @@ class ShowTablesSuite extends ShowTablesSuiteBase with 
CommandSuiteBase {
   }
 }
   }
+
+  override protected def extendedPartInNonPartedTableError(
+  catalog: String,
+  namespace: String,
+  table: String): (String, Map[String, String]) = {
+("_LEGACY_ERROR_TEMP_1251",

Review Comment:
   how hard it is to unify this error between v1 and v2 tables?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45767][CORE] Delete `TimeStampedHashMap` and its UT [spark]

2023-11-02 Thread via GitHub



LuciferYang closed pull request #43633: [SPARK-45767][CORE] Delete 
`TimeStampedHashMap` and its UT
URL: https://github.com/apache/spark/pull/43633


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-02 Thread via GitHub



cloud-fan commented on code in PR #37588:
URL: https://github.com/apache/spark/pull/37588#discussion_r1380268144


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablesExtendedExec.scala:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.jdk.CollectionConverters._
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.ResolvedPartitionSpec
+import org.apache.spark.sql.catalyst.catalog.CatalogTableType
+import 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils.escapePathName
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, Literal}
+import org.apache.spark.sql.catalyst.util.{quoteIdentifier, StringUtils}
+import org.apache.spark.sql.connector.catalog.{CatalogV2Util, Identifier, 
SupportsPartitionManagement, Table, TableCatalog}
+import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.execution.LeafExecNode
+import 
org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits.TableHelper
+import org.apache.spark.sql.types.{StringType, StructType}
+
+/**
+ * Physical plan node for showing tables without partition, Show the 
information of tables.
+ */
+case class ShowTablesExtendedExec(
+output: Seq[Attribute],
+catalog: TableCatalog,
+namespace: Seq[String],
+pattern: String) extends V2CommandExec with LeafExecNode {
+  override protected def run(): Seq[InternalRow] = {
+val rows = new ArrayBuffer[InternalRow]()
+
+// fetch tables
+// TODO We need a new listTable overload that takes a pattern string.
+val tables = catalog.listTables(namespace.toArray)
+tables.map { tableIdent =>
+  if (StringUtils.filterPattern(Seq(tableIdent.name()), pattern).nonEmpty) 
{
+val table = catalog.loadTable(tableIdent)
+val information = getTableDetails(catalog.name, tableIdent, table)
+rows += toCatalystRow(tableIdent.namespace().quoted, 
tableIdent.name(), false,
+  s"$information\n")
+}
+  }
+
+// fetch temp views, includes: global temp view, local temp view
+val sessionCatalog = session.sessionState.catalog
+val db = namespace match {
+  case Seq(db) => Some(db)
+  case _ => None
+}
+val views = sessionCatalog.listTempViews(db.get, pattern)
+views.map { viewIdent =>
+  val database = viewIdent.database.getOrElse("")
+  val tableName = viewIdent.table
+  val isTemp = sessionCatalog.isTempView(viewIdent)
+  val view = sessionCatalog.getTempViewOrPermanentTableMetadata(viewIdent)
+  val information = view.simpleString
+  rows += toCatalystRow(database, tableName, isTemp, s"$information\n")
+}
+
+rows.toSeq
+  }
+
+  private def getTableDetails(
+  catalogName: String,
+  identifier: Identifier,
+  table: Table): String = {
+val results = new mutable.LinkedHashMap[String, String]()
+
+results.put("Catalog", catalogName)
+results.put("Namespace", identifier.namespace().quoted)
+results.put("Table", identifier.name())
+val tableType = if 
(table.properties().containsKey(TableCatalog.PROP_EXTERNAL)) {
+  CatalogTableType.EXTERNAL
+} else {
+  CatalogTableType.MANAGED
+}
+results.put("Type", tableType.name)
+
+CatalogV2Util.TABLE_RESERVED_PROPERTIES
+  .filterNot(_ == TableCatalog.PROP_EXTERNAL)
+  .foreach(propKey => {
+if (table.properties.containsKey(propKey)) {
+  results.put(propKey.capitalize, table.properties.get(propKey))
+}
+  })
+
+val properties =
+  conf.redactOptions(table.properties.asScala.toMap).toList
+.filter(kv => !CatalogV2Util.TABLE_RESERVED_PROPERTIES.contains(kv._1))
+.sortBy(_._1).map {
+case (key, value) => key + "=" + value
+  }.mkString("[", ",", "]")
+if (table.properties().isEmpty) {
+  results.put("Table Properties", properties.mkString("[", ", ", "]"))
+}
+
+// Partition

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-02 Thread via GitHub



cloud-fan commented on code in PR #37588:
URL: https://github.com/apache/spark/pull/37588#discussion_r1380283326


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuiteBase.scala:
##
@@ -126,4 +158,258 @@ trait ShowTablesSuiteBase extends QueryTest with 
DDLCommandTestUtils {
   }
 }
   }
+
+  test("show table in a not existing namespace") {
+checkError(
+  exception = intercept[AnalysisException] {
+sql(s"SHOW TABLES IN $catalog.nonexist")
+  },
+  errorClass = "SCHEMA_NOT_FOUND",
+  parameters = Map("schemaName" -> "`nonexist`"))
+  }
+
+  test("show table extended in a not existing namespace") {
+checkError(
+  exception = intercept[AnalysisException] {
+sql(s"SHOW TABLE EXTENDED IN $catalog.nonexist LIKE '*tbl*'")
+  },
+  errorClass = "SCHEMA_NOT_FOUND",
+  parameters = Map("schemaName" -> "`nonexist`"))
+  }
+
+  test("show table extended in a not existing table") {
+val namespace = "ns1"
+val table = "nonexist"
+withNamespaceAndTable(namespace, table, catalog) { _ =>
+  val result = sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE 
'*$table*'")
+  assert(result.schema.fieldNames ===
+Seq("namespace", "tableName", "isTemporary", "information"))
+  assert(result.collect().isEmpty)
+}
+  }
+
+  test("show table extended in a not existing partition") {
+val namespace = "ns1"
+val table = "tbl"
+withNamespaceAndTable(namespace, table, catalog) { tbl =>
+  sql(s"CREATE TABLE $tbl (id bigint, data string) $defaultUsing 
PARTITIONED BY (id)")
+  sql(s"ALTER TABLE $tbl ADD PARTITION (id = 1)")
+  checkError(
+exception = intercept[AnalysisException] {
+  sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE '$table' 
PARTITION(id = 2)")
+},
+errorClass = "PARTITIONS_NOT_FOUND",
+parameters = Map(
+  "partitionList" -> "PARTITION (`id` = 2)",
+  "tableName" -> "`ns1`.`tbl`"
+)
+  )
+}
+  }
+
+  test("show table extended in non-partitioned table") {
+val namespace = "ns1"
+val table = "tbl"
+withNamespaceAndTable(namespace, table, catalog) { tbl =>
+  sql(s"CREATE TABLE $tbl (id bigint, data string) $defaultUsing")
+  val e = intercept[AnalysisException] {
+sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE '$table' 
PARTITION(id = 1)")
+  }
+  val (errorClass, parameters) = 
extendedPartInNonPartedTableError(catalog, namespace, table)
+  checkError(exception = e, errorClass = errorClass, parameters = 
parameters)
+}
+  }
+
+  test("show table extended in multi partition key - " +
+"the command's partition parameters are complete") {
+val namespace = "ns1"
+val table = "tbl"
+withNamespaceAndTable(namespace, table, catalog) { tbl =>
+  sql(s"CREATE TABLE $tbl (id1 bigint, id2 bigint, data string) " +
+s"$defaultUsing PARTITIONED BY (id1, id2)")

Review Comment:
   nit: I think tests in the base suite should always partition by the ending 
columns. We can add a new simple test in v1 suite to prove that partition 
columns are always at the end, and a new simple test in v2 to prove that we 
respect the original table schema.
   
   I think the v1 behavior is probably a bug, but we may never fix it as it 
becomes a feature :(



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-02 Thread via GitHub



cloud-fan commented on code in PR #37588:
URL: https://github.com/apache/spark/pull/37588#discussion_r1380247392


##
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:
##
@@ -2483,7 +2477,7 @@ private[sql] object QueryCompilationErrors extends 
QueryErrorsBase with Compilat
   messageParameters = Map(
 "specKeys" -> specKeys,
 "partitionColumnNames" -> partitionColumnNames.mkString(", "),
-"tableName" -> tableName))
+"tableName" -> toSQLId(tableName)))

Review Comment:
   This change is necessary as it helps to unify the v1/v2 command behavior, 
which is an important goal of adding new v2 commands



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-02 Thread via GitHub



cloud-fan commented on code in PR #37588:
URL: https://github.com/apache/spark/pull/37588#discussion_r1380278877


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuiteBase.scala:
##
@@ -126,4 +158,258 @@ trait ShowTablesSuiteBase extends QueryTest with 
DDLCommandTestUtils {
   }
 }
   }
+
+  test("show table in a not existing namespace") {
+checkError(
+  exception = intercept[AnalysisException] {
+sql(s"SHOW TABLES IN $catalog.nonexist")
+  },
+  errorClass = "SCHEMA_NOT_FOUND",
+  parameters = Map("schemaName" -> "`nonexist`"))
+  }
+
+  test("show table extended in a not existing namespace") {
+checkError(
+  exception = intercept[AnalysisException] {
+sql(s"SHOW TABLE EXTENDED IN $catalog.nonexist LIKE '*tbl*'")
+  },
+  errorClass = "SCHEMA_NOT_FOUND",
+  parameters = Map("schemaName" -> "`nonexist`"))
+  }
+
+  test("show table extended in a not existing table") {
+val namespace = "ns1"
+val table = "nonexist"
+withNamespaceAndTable(namespace, table, catalog) { _ =>
+  val result = sql(s"SHOW TABLE EXTENDED IN $catalog.$namespace LIKE 
'*$table*'")
+  assert(result.schema.fieldNames ===
+Seq("namespace", "tableName", "isTemporary", "information"))
+  assert(result.collect().isEmpty)
+}
+  }
+
+  test("show table extended in a not existing partition") {

Review Comment:
   ```suggestion
 test("show table extended with a not existing partition") {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-02 Thread via GitHub



cloud-fan commented on code in PR #37588:
URL: https://github.com/apache/spark/pull/37588#discussion_r1380278496


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuiteBase.scala:
##
@@ -126,4 +158,258 @@ trait ShowTablesSuiteBase extends QueryTest with 
DDLCommandTestUtils {
   }
 }
   }
+
+  test("show table in a not existing namespace") {
+checkError(
+  exception = intercept[AnalysisException] {
+sql(s"SHOW TABLES IN $catalog.nonexist")
+  },
+  errorClass = "SCHEMA_NOT_FOUND",
+  parameters = Map("schemaName" -> "`nonexist`"))
+  }
+
+  test("show table extended in a not existing namespace") {
+checkError(
+  exception = intercept[AnalysisException] {
+sql(s"SHOW TABLE EXTENDED IN $catalog.nonexist LIKE '*tbl*'")
+  },
+  errorClass = "SCHEMA_NOT_FOUND",
+  parameters = Map("schemaName" -> "`nonexist`"))
+  }
+
+  test("show table extended in a not existing table") {

Review Comment:
   ```suggestion
 test("show table extended with no matching table") {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-02 Thread via GitHub



cloud-fan commented on code in PR #37588:
URL: https://github.com/apache/spark/pull/37588#discussion_r1380276558


##
sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowTablesSuiteBase.scala:
##
@@ -40,6 +40,38 @@ trait ShowTablesSuiteBase extends QueryTest with 
DDLCommandTestUtils {
 checkAnswer(df, expected)
   }
 
+  // the error class & error parameters of
+  // `SHOW TABLE EXTENDED ... PARTITION ... in non-partitioned table`
+  protected def extendedPartInNonPartedTableError(
+  catalog: String,
+  namespace: String,
+  table: String): (String, Map[String, String])
+
+  protected def extendedPartExpectedResult: String =
+"Partition Values: [id1=1, id2=2]"
+
+  protected def namespaceKey: String = "Database"
+
+  protected def extendedTableExpectedResultDiff: String
+
+  private def extendedTableExpectedResult(
+  catalog: String,
+  namespaceName: String,
+  namespace: String,
+  table: String,
+  partColName: String,
+  dataColName: String): String = {
+s"""Catalog: $catalog
+   |$namespaceName: $namespace
+   |Table: $table
+   |$extendedTableExpectedResultDiff

Review Comment:
   this name is a bit confusing. Do you mean `extendedTableInfo`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-02 Thread via GitHub



cloud-fan commented on code in PR #37588:
URL: https://github.com/apache/spark/pull/37588#discussion_r1380266348


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablesExtendedExec.scala:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.jdk.CollectionConverters._
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.ResolvedPartitionSpec
+import org.apache.spark.sql.catalyst.catalog.CatalogTableType
+import 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils.escapePathName
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, Literal}
+import org.apache.spark.sql.catalyst.util.{quoteIdentifier, StringUtils}
+import org.apache.spark.sql.connector.catalog.{CatalogV2Util, Identifier, 
SupportsPartitionManagement, Table, TableCatalog}
+import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.execution.LeafExecNode
+import 
org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits.TableHelper
+import org.apache.spark.sql.types.{StringType, StructType}
+
+/**
+ * Physical plan node for showing tables without partition, Show the 
information of tables.
+ */
+case class ShowTablesExtendedExec(
+output: Seq[Attribute],
+catalog: TableCatalog,
+namespace: Seq[String],
+pattern: String) extends V2CommandExec with LeafExecNode {
+  override protected def run(): Seq[InternalRow] = {
+val rows = new ArrayBuffer[InternalRow]()
+
+// fetch tables
+// TODO We need a new listTable overload that takes a pattern string.
+val tables = catalog.listTables(namespace.toArray)
+tables.map { tableIdent =>
+  if (StringUtils.filterPattern(Seq(tableIdent.name()), pattern).nonEmpty) 
{
+val table = catalog.loadTable(tableIdent)
+val information = getTableDetails(catalog.name, tableIdent, table)
+rows += toCatalystRow(tableIdent.namespace().quoted, 
tableIdent.name(), false,
+  s"$information\n")
+}
+  }
+
+// fetch temp views, includes: global temp view, local temp view
+val sessionCatalog = session.sessionState.catalog
+val db = namespace match {
+  case Seq(db) => Some(db)
+  case _ => None
+}
+val views = sessionCatalog.listTempViews(db.get, pattern)
+views.map { viewIdent =>
+  val database = viewIdent.database.getOrElse("")
+  val tableName = viewIdent.table
+  val isTemp = sessionCatalog.isTempView(viewIdent)
+  val view = sessionCatalog.getTempViewOrPermanentTableMetadata(viewIdent)
+  val information = view.simpleString
+  rows += toCatalystRow(database, tableName, isTemp, s"$information\n")
+}
+
+rows.toSeq
+  }
+
+  private def getTableDetails(
+  catalogName: String,
+  identifier: Identifier,
+  table: Table): String = {
+val results = new mutable.LinkedHashMap[String, String]()
+
+results.put("Catalog", catalogName)
+results.put("Namespace", identifier.namespace().quoted)
+results.put("Table", identifier.name())
+val tableType = if 
(table.properties().containsKey(TableCatalog.PROP_EXTERNAL)) {
+  CatalogTableType.EXTERNAL
+} else {
+  CatalogTableType.MANAGED
+}
+results.put("Type", tableType.name)
+
+CatalogV2Util.TABLE_RESERVED_PROPERTIES
+  .filterNot(_ == TableCatalog.PROP_EXTERNAL)
+  .foreach(propKey => {
+if (table.properties.containsKey(propKey)) {
+  results.put(propKey.capitalize, table.properties.get(propKey))
+}
+  })
+
+val properties =
+  conf.redactOptions(table.properties.asScala.toMap).toList
+.filter(kv => !CatalogV2Util.TABLE_RESERVED_PROPERTIES.contains(kv._1))
+.sortBy(_._1).map {
+case (key, value) => key + "=" + value
+  }.mkString("[", ",", "]")
+if (table.properties().isEmpty) {
+  results.put("Table Properties", properties.mkString("[", ", ", "]"))
+}
+
+// Partition

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-02 Thread via GitHub



cloud-fan commented on code in PR #37588:
URL: https://github.com/apache/spark/pull/37588#discussion_r1380265564


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablesExtendedExec.scala:
##
@@ -0,0 +1,194 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.v2
+
+import scala.collection.mutable
+import scala.collection.mutable.ArrayBuffer
+import scala.jdk.CollectionConverters._
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.analysis.ResolvedPartitionSpec
+import org.apache.spark.sql.catalyst.catalog.CatalogTableType
+import 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils.escapePathName
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, Literal}
+import org.apache.spark.sql.catalyst.util.{quoteIdentifier, StringUtils}
+import org.apache.spark.sql.connector.catalog.{CatalogV2Util, Identifier, 
SupportsPartitionManagement, Table, TableCatalog}
+import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+import org.apache.spark.sql.errors.QueryCompilationErrors
+import org.apache.spark.sql.execution.LeafExecNode
+import 
org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits.TableHelper
+import org.apache.spark.sql.types.{StringType, StructType}
+
+/**
+ * Physical plan node for showing tables without partition, Show the 
information of tables.
+ */
+case class ShowTablesExtendedExec(
+output: Seq[Attribute],
+catalog: TableCatalog,
+namespace: Seq[String],
+pattern: String) extends V2CommandExec with LeafExecNode {
+  override protected def run(): Seq[InternalRow] = {
+val rows = new ArrayBuffer[InternalRow]()
+
+// fetch tables
+// TODO We need a new listTable overload that takes a pattern string.
+val tables = catalog.listTables(namespace.toArray)
+tables.map { tableIdent =>
+  if (StringUtils.filterPattern(Seq(tableIdent.name()), pattern).nonEmpty) 
{
+val table = catalog.loadTable(tableIdent)
+val information = getTableDetails(catalog.name, tableIdent, table)
+rows += toCatalystRow(tableIdent.namespace().quoted, 
tableIdent.name(), false,
+  s"$information\n")
+}
+  }
+
+// fetch temp views, includes: global temp view, local temp view
+val sessionCatalog = session.sessionState.catalog
+val db = namespace match {
+  case Seq(db) => Some(db)
+  case _ => None
+}
+val views = sessionCatalog.listTempViews(db.get, pattern)
+views.map { viewIdent =>
+  val database = viewIdent.database.getOrElse("")
+  val tableName = viewIdent.table
+  val isTemp = sessionCatalog.isTempView(viewIdent)
+  val view = sessionCatalog.getTempViewOrPermanentTableMetadata(viewIdent)
+  val information = view.simpleString
+  rows += toCatalystRow(database, tableName, isTemp, s"$information\n")
+}
+
+rows.toSeq
+  }
+
+  private def getTableDetails(
+  catalogName: String,
+  identifier: Identifier,
+  table: Table): String = {
+val results = new mutable.LinkedHashMap[String, String]()
+
+results.put("Catalog", catalogName)
+results.put("Namespace", identifier.namespace().quoted)
+results.put("Table", identifier.name())
+val tableType = if 
(table.properties().containsKey(TableCatalog.PROP_EXTERNAL)) {
+  CatalogTableType.EXTERNAL
+} else {
+  CatalogTableType.MANAGED
+}
+results.put("Type", tableType.name)
+
+CatalogV2Util.TABLE_RESERVED_PROPERTIES
+  .filterNot(_ == TableCatalog.PROP_EXTERNAL)
+  .foreach(propKey => {
+if (table.properties.containsKey(propKey)) {
+  results.put(propKey.capitalize, table.properties.get(propKey))
+}
+  })
+
+val properties =
+  conf.redactOptions(table.properties.asScala.toMap).toList
+.filter(kv => !CatalogV2Util.TABLE_RESERVED_PROPERTIES.contains(kv._1))
+.sortBy(_._1).map {
+case (key, value) => key + "=" + value
+  }.mkString("[", ",", "]")
+if (table.properties().isEmpty) {
+  results.put("Table Properties", properties.mkString("[", ", ", "]"))
+}
+
+// Partition

Re: [PR] [SPARK-33393][SQL] Support SHOW TABLE EXTENDED in v2 [spark]

2023-11-02 Thread via GitHub



cloud-fan commented on code in PR #37588:
URL: https://github.com/apache/spark/pull/37588#discussion_r1380269913


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:
##
@@ -1090,6 +1090,22 @@ class SessionCatalog(
 dbViews ++ listLocalTempViews(pattern)
   }
 
+  /**
+   * List all matching temp views in the specified database, including 
global/local temporary views.
+   */
+  def listTempViews(db: String, pattern: String): Seq[TableIdentifier] = {

Review Comment:
   shall we just return `Seq[CatalogTable]`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 140 matches

Mail list logo