date:20230217

[GitHub] [spark] MaxGekk closed pull request #40074: [SPARK-42430][DOC][FOLLOW-UP] Revise the java doc for TimestampNTZ & ANSI interval types

2023-02-17 Thread via GitHub



MaxGekk closed pull request #40074: [SPARK-42430][DOC][FOLLOW-UP] Revise the 
java doc for TimestampNTZ & ANSI interval types
URL: https://github.com/apache/spark/pull/40074


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #40074: [SPARK-42430][DOC][FOLLOW-UP] Revise the java doc for TimestampNTZ & ANSI interval types

2023-02-17 Thread via GitHub



MaxGekk commented on PR #40074:
URL: https://github.com/apache/spark/pull/40074#issuecomment-1435512116

   +1, LGTM. Merging to master/3.4.
   Thank you, @gengliangwang.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #40074: [SPARK-42430][DOC][FOLLOW-UP] Revise the java doc for TimestampNTZ & ANSI interval types

2023-02-17 Thread via GitHub



MaxGekk commented on PR #40074:
URL: https://github.com/apache/spark/pull/40074#issuecomment-1435511800

   The kub test is not related to this PR, I believe:
   ```
   [info] org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite *** 
ABORTED *** (27 minutes, 35 seconds)
   [info]   io.fabric8.kubernetes.client.KubernetesClientException: Failure 
executing: POST at: https://192.168.49.2:8443/api/v1/namespaces. Message: 
object is being deleted: namespaces "spark-e7de0ffd81044f09afb2693a0e227a43" 
already exists. Received status: 
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk closed pull request #39910: [SPARK-42337][SQL] Add error class INVALID_TEMP_OBJ_REFERENCE

2023-02-17 Thread via GitHub



MaxGekk closed pull request #39910: [SPARK-42337][SQL] Add error class 
INVALID_TEMP_OBJ_REFERENCE
URL: https://github.com/apache/spark/pull/39910


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #39910: [SPARK-42337][SQL] Add error class INVALID_TEMP_OBJ_REFERENCE

2023-02-17 Thread via GitHub



MaxGekk commented on PR #39910:
URL: https://github.com/apache/spark/pull/39910#issuecomment-1435510023

   +1, LGTM. Merging to master/3.4.
   Thank you, @allisonwang-db.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #39910: [SPARK-42337][SQL] Add error class INVALID_TEMP_OBJ_REFERENCE

2023-02-17 Thread via GitHub



MaxGekk commented on PR #39910:
URL: https://github.com/apache/spark/pull/39910#issuecomment-1435509673

   The kub test is not related to this PR, I think:
   ```
   [info] - SPARK-38187: Run SparkPi Jobs with minCPU *** FAILED *** (3 
minutes, 1 second)
   [info]   The code passed to eventually never returned normally. Attempted 
189 times over 3.0106043718 minutes. Last failure message: 0 did not equal 2. 
(VolcanoTestsSuite.scala:302)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-17 Thread via GitHub



ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1110773664


##
core/src/main/scala/org/apache/spark/storage/BlockManager.scala:
##
@@ -1325,31 +1328,71 @@ private[spark] class BlockManager(
 blockInfoManager.releaseAllLocksForTask(taskAttemptId)
   }
 
+  /**
+   * Retrieve the given rdd block if it exists and is visible, otherwise call 
the provided
+   * `makeIterator` method to compute the block, persist it, and return its 
values.
+   *
+   * @return either a BlockResult if the block was successfully cached, or an 
iterator if the block
+   * could not be cached.
+   */
+  def getOrElseUpdateRDDBlock[T](
+  taskId: Long,
+  blockId: RDDBlockId,
+  level: StorageLevel,
+  classTag: ClassTag[T],
+  makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
+val isCacheVisible = isRDDBlockVisible(blockId)
+val res = getOrElseUpdate(blockId, level, classTag, makeIterator, 
isCacheVisible)
+if (res.isLeft && !isCacheVisible) {
+  // Block exists but not visible, report taskId -> blockId info to master.
+  master.updateRDDBlockTaskInfo(blockId, taskId)
+}
+
+res
+  }
+
   /**
* Retrieve the given block if it exists, otherwise call the provided 
`makeIterator` method
* to compute the block, persist it, and return its values.
*
* @return either a BlockResult if the block was successfully cached, or an 
iterator if the block
* could not be cached.
*/
-  def getOrElseUpdate[T](
+  private def getOrElseUpdate[T](
   blockId: BlockId,
   level: StorageLevel,
   classTag: ClassTag[T],
-  makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
-// Attempt to read the block from local or remote storage. If it's 
present, then we don't need
-// to go through the local-get-or-put path.
-get[T](blockId)(classTag) match {
-  case Some(block) =>
-return Left(block)
-  case _ =>
-// Need to compute the block.
+  makeIterator: () => Iterator[T],
+  isCacheVisible: Boolean = true): Either[BlockResult, Iterator[T]] = {
+// Track whether the data is computed or not, force to do the computation 
later if need to.
+// The reason we push the force computing later is that once the executor 
is decommissioned we
+// will have a better chance to replicate the cache block because of the 
`checkShouldStore`
+// validation when putting a new block.
+var computed: Boolean = false
+val iterator = () => {
+  computed = true
+  makeIterator()
+}
+if (isCacheVisible) {
+  // Attempt to read the block from local or remote storage. If it's 
present, then we don't need
+  // to go through the local-get-or-put path.
+  get[T](blockId)(classTag) match {
+case Some(block) =>
+  return Left(block)
+case _ =>
+  // Need to compute the block.
+  }
 }
+
 // Initially we hold no locks on this block.
-doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true) 
match {
+doPutIterator(blockId, iterator, level, classTag, keepReadLock = true) 
match {
   case None =>
 // doPut() didn't hand work back to us, so the block already existed 
or was successfully
 // stored. Therefore, we now hold a read lock on the block.
+if (!isCacheVisible && !computed) {
+  // Force compute to report accumulator updates.

Review Comment:
   Yes, the recomputation is only for updating accumulators. The reulst should 
be the same unless the result is indeterminate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun closed pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results

2023-02-17 Thread via GitHub



dongjoon-hyun closed pull request #40072: [SPARK-42483][TESTS] Regenerate 
benchmark results
URL: https://github.com/apache/spark/pull/40072


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results

2023-02-17 Thread via GitHub



dongjoon-hyun commented on PR #40072:
URL: https://github.com/apache/spark/pull/40072#issuecomment-1435501963

   Thank you so much always for your help, @viirya !
   Merged to master for Apache Spark 3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results

2023-02-17 Thread via GitHub



viirya commented on code in PR #40072:
URL: https://github.com/apache/spark/pull/40072#discussion_r1110703022


##
sql/core/benchmarks/DataSourceReadBenchmark-results.txt:
##
@@ -2,430 +2,430 @@
 SQL Single Numeric Column Scan
 

 
-OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1031-azure
 Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 SQL Single BOOLEAN Column Scan:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-SQL CSV   10433  10554 
172  1.5 663.3   1.0X
-SQL Json   7948   7990 
 60  2.0 505.3   1.3X
-SQL Parquet Vectorized: DataPageV1  126149 
 22125.2   8.0  83.0X
-SQL Parquet Vectorized: DataPageV2   99113 
 17158.6   6.3 105.2X
-SQL Parquet MR: DataPageV1 1777   1784 
  9  8.8 113.0   5.9X
-SQL Parquet MR: DataPageV2 1579   1583 
  6 10.0 100.4   6.6X
-SQL ORC Vectorized  158165 
  5 99.7  10.0  66.1X
-SQL ORC MR 1654   1661 
  9  9.5 105.2   6.3X
-
-OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure
+SQL CSV   13143  13363 
311  1.2 835.6   1.0X

Review Comment:
   Hmm, it's significant.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40053: [SPARK-42470][SQL] Remove unused declarations from Hive module

2023-02-17 Thread via GitHub



dongjoon-hyun commented on PR #40053:
URL: https://github.com/apache/spark/pull/40053#issuecomment-1435487106

   Thank you, @LuciferYang and @huaxingao !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results

2023-02-17 Thread via GitHub



dongjoon-hyun commented on PR #40072:
URL: https://github.com/apache/spark/pull/40072#issuecomment-1435486887

   When you have some time, could you review this, @viirya ? I want to merge 
this to proceed the further investigations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on pull request #40053: [SPARK-42470][SQL] Remove unused declarations from Hive module

2023-02-17 Thread via GitHub



huaxingao commented on PR #40053:
URL: https://github.com/apache/spark/pull/40053#issuecomment-1435483886

   Merged to master. Thanks @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao closed pull request #40053: [SPARK-42470][SQL] Remove unused declarations from Hive module

2023-02-17 Thread via GitHub



huaxingao closed pull request #40053: [SPARK-42470][SQL] Remove unused 
declarations from Hive module
URL: https://github.com/apache/spark/pull/40053


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40065: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.5

2023-02-17 Thread via GitHub



dongjoon-hyun commented on PR #40065:
URL: https://github.com/apache/spark/pull/40065#issuecomment-1435482433

   Got it. Thank you for informing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40069: [SPARK-42480][SQL] Improve the performance of drop partitions

2023-02-17 Thread via GitHub



LuciferYang commented on code in PR #40069:
URL: https://github.com/apache/spark/pull/40069#discussion_r1110563308


##
sql/catalyst/src/main/scala/org/apache/spark/sql/util/PartitioningUtils.scala:
##
@@ -30,6 +30,7 @@ import org.apache.spark.sql.types.{CharType, DataType, 
StringType, StructField,
 import org.apache.spark.unsafe.types.UTF8String
 
 private[sql] object PartitioningUtils {
+  private val PATTERN_FOR_KEY_EQ_VAL = "(.+)=(.+)".r

Review Comment:
   I remember seeing similar cases in the production environment, but I can't 
remember the details. Need to have tests to check the corner scenes we can 
think of
   
   cc @wangyum @sunchao FYI



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40069: [SPARK-42480][SQL] Improve the performance of drop partitions

2023-02-17 Thread via GitHub



LuciferYang commented on code in PR #40069:
URL: https://github.com/apache/spark/pull/40069#discussion_r1110563308


##
sql/catalyst/src/main/scala/org/apache/spark/sql/util/PartitioningUtils.scala:
##
@@ -30,6 +30,7 @@ import org.apache.spark.sql.types.{CharType, DataType, 
StringType, StructField,
 import org.apache.spark.unsafe.types.UTF8String
 
 private[sql] object PartitioningUtils {
+  private val PATTERN_FOR_KEY_EQ_VAL = "(.+)=(.+)".r

Review Comment:
   cc @wangyum @sunchao FYI



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ozhembr opened a new pull request, #40077: [SPIP][POC] Driver scaling: parallel schedulers

2023-02-17 Thread via GitHub



ozhembr opened a new pull request, #40077:
URL: https://github.com/apache/spark/pull/40077

   
   
   ### What changes were proposed in this pull request?
   SPIP: 
https://docs.google.com/document/d/1_MVEpGxz6U_CNqKArR1M1l2oP-3I7O67grfwPtniLaA/edit?usp=sharing
   
   POC of scaling Spark Driver via parallel schedulers.
   Uses multiple groups of `CoarseGrainedSchedulerBackend, TaskSchedulerImpl`
   
   
   
   ### Why are the changes needed?
   Low performance of Spark Driver with multiple large jobs.
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   Configs for enabling parallel schedulers:
   `spark.driver.schedulers.parallelism` - number of parallel schedulers, no 
value or <= 1 will disable parallelism
   
   
   
   
   ### How was this patch tested?
   Comparison tests with spark-sql processes of same parallelism level 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #40065: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.5

2023-02-17 Thread via GitHub



LuciferYang commented on PR #40065:
URL: https://github.com/apache/spark/pull/40065#issuecomment-1435477929

   @dongjoon-hyun found a new issue related to 2.7.5: 
https://github.com/CycloneDX/cyclonedx-maven-plugin/issues/284
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wecharyu commented on a diff in pull request #40069: [SPARK-42480][SQL] Improve the performance of drop partitions

2023-02-17 Thread via GitHub



wecharyu commented on code in PR #40069:
URL: https://github.com/apache/spark/pull/40069#discussion_r1110559214


##
sql/catalyst/src/main/scala/org/apache/spark/sql/util/PartitioningUtils.scala:
##
@@ -30,6 +30,7 @@ import org.apache.spark.sql.types.{CharType, DataType, 
StringType, StructField,
 import org.apache.spark.unsafe.types.UTF8String
 
 private[sql] object PartitioningUtils {
+  private val PATTERN_FOR_KEY_EQ_VAL = "(.+)=(.+)".r

Review Comment:
   @LuciferYang Thanks for your review, partition name is always followed this 
rule in Hive 
[makePartName](https://github.com/apache/hive/blob/1747bc7b8c58d84f455c95f26f1183a65246f331/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java#L571).
   Partition name is only related to  partition keys and values, other 
partition fields like location will not affect it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] jchen5 commented on a diff in pull request #39759: [SPARK-36124][SQL] Support subqueries with correlation through INTERSECT/EXCEPT

2023-02-17 Thread via GitHub



jchen5 commented on code in PR #39759:
URL: https://github.com/apache/spark/pull/39759#discussion_r1110500495


##
sql/core/src/test/resources/sql-tests/results/join-lateral.sql.out:
##
@@ -1120,6 +1258,105 @@ struct
 1  2   3
 
 
+-- !query
+SELECT * FROM t1 JOIN LATERAL
+  ((SELECT t2.c2
+  FROM   t2
+  WHERE  t2.c1 = t1.c1
+  EXCEPT DISTINCT
+  SELECT t4.c2
+  FROM   t4
+  WHERE  t4.c1 > t1.c2)
+  UNION DISTINCT
+  (SELECT t4.c1
+  FROM   t4
+  WHERE  t4.c1 <= t1.c2
+  INTERSECT ALL
+  SELECT t4.c2
+  FROM   t4
+  WHERE  t4.c1 <> t1.c1)
+)
+-- !query schema
+struct
+-- !query output
+0  1   1
+0  1   2
+0  1   3
+1  2   1
+
+
+-- !query
+SELECT * FROM t1 JOIN LATERAL
+  ((SELECT t2.c2
+  FROM   t2
+  WHERE  t2.c1 = t1.c1
+  UNION ALL
+  SELECT t4.c2
+  FROM   t4
+  WHERE  t4.c1 > t1.c2)
+  INTERSECT DISTINCT
+  (SELECT t4.c1
+  FROM   t4
+  WHERE  t4.c1 <= t1.c2
+  EXCEPT ALL
+  SELECT t4.c2
+  FROM   t4
+  WHERE  t4.c1 <> t1.c1)
+)
+-- !query schema
+struct
+-- !query output
+
+
+
+-- !query
+SELECT * FROM t1 JOIN LATERAL (SELECT sum(c1) FROM
+  (SELECT *
+  FROM   t2
+  WHERE  t2.c1 <= t1.c1) lhs
+  LEFT SEMI JOIN
+  (SELECT *
+  FROM   t4) rhs
+  ON lhs.c1 <=> rhs.c1 and lhs.c2 <=> rhs.c2
+)
+-- !query schema
+struct
+-- !query output
+0  1   0
+1  2   0
+
+
+-- !query
+SELECT * FROM t1 JOIN LATERAL (SELECT sum(c1) FROM
+  (SELECT *
+  FROM   t2
+  WHERE  t2.c1 <= t1.c1) lhs
+  LEFT SEMI JOIN
+  (SELECT *
+  FROM   t4
+  WHERE t4.c1 > t1.c2) rhs
+  ON lhs.c1 <=> rhs.c1 and lhs.c2 <=> rhs.c2
+)
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.AnalysisException
+{
+  "errorClass" : 
"UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED",

Review Comment:
   Yes, expected. This case is Semi join with correlation on right side, which 
is unsupported (it's commented in the query input file, is there a way to get 
those comments to show up in the output file too for easier visibility?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] jchen5 commented on a diff in pull request #39759: [SPARK-36124][SQL] Support subqueries with correlation through INTERSECT/EXCEPT

2023-02-17 Thread via GitHub



jchen5 commented on code in PR #39759:
URL: https://github.com/apache/spark/pull/39759#discussion_r1110500495


##
sql/core/src/test/resources/sql-tests/results/join-lateral.sql.out:
##
@@ -1120,6 +1258,105 @@ struct
 1  2   3
 
 
+-- !query
+SELECT * FROM t1 JOIN LATERAL
+  ((SELECT t2.c2
+  FROM   t2
+  WHERE  t2.c1 = t1.c1
+  EXCEPT DISTINCT
+  SELECT t4.c2
+  FROM   t4
+  WHERE  t4.c1 > t1.c2)
+  UNION DISTINCT
+  (SELECT t4.c1
+  FROM   t4
+  WHERE  t4.c1 <= t1.c2
+  INTERSECT ALL
+  SELECT t4.c2
+  FROM   t4
+  WHERE  t4.c1 <> t1.c1)
+)
+-- !query schema
+struct
+-- !query output
+0  1   1
+0  1   2
+0  1   3
+1  2   1
+
+
+-- !query
+SELECT * FROM t1 JOIN LATERAL
+  ((SELECT t2.c2
+  FROM   t2
+  WHERE  t2.c1 = t1.c1
+  UNION ALL
+  SELECT t4.c2
+  FROM   t4
+  WHERE  t4.c1 > t1.c2)
+  INTERSECT DISTINCT
+  (SELECT t4.c1
+  FROM   t4
+  WHERE  t4.c1 <= t1.c2
+  EXCEPT ALL
+  SELECT t4.c2
+  FROM   t4
+  WHERE  t4.c1 <> t1.c1)
+)
+-- !query schema
+struct
+-- !query output
+
+
+
+-- !query
+SELECT * FROM t1 JOIN LATERAL (SELECT sum(c1) FROM
+  (SELECT *
+  FROM   t2
+  WHERE  t2.c1 <= t1.c1) lhs
+  LEFT SEMI JOIN
+  (SELECT *
+  FROM   t4) rhs
+  ON lhs.c1 <=> rhs.c1 and lhs.c2 <=> rhs.c2
+)
+-- !query schema
+struct
+-- !query output
+0  1   0
+1  2   0
+
+
+-- !query
+SELECT * FROM t1 JOIN LATERAL (SELECT sum(c1) FROM
+  (SELECT *
+  FROM   t2
+  WHERE  t2.c1 <= t1.c1) lhs
+  LEFT SEMI JOIN
+  (SELECT *
+  FROM   t4
+  WHERE t4.c1 > t1.c2) rhs
+  ON lhs.c1 <=> rhs.c1 and lhs.c2 <=> rhs.c2
+)
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.AnalysisException
+{
+  "errorClass" : 
"UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED",

Review Comment:
   Yes, this case is Semi join with correlation on right side, which is 
unsupported (it's commented in the query input file, is there a way to get 
those comments to show up in the output file too for easier visibility?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results

2023-02-17 Thread via GitHub



dongjoon-hyun commented on code in PR #40072:
URL: https://github.com/apache/spark/pull/40072#discussion_r1110497257


##
sql/core/benchmarks/DataSourceReadBenchmark-results.txt:
##
@@ -2,430 +2,430 @@
 SQL Single Numeric Column Scan
 

 
-OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1031-azure
 Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 SQL Single BOOLEAN Column Scan:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-SQL CSV   10433  10554 
172  1.5 663.3   1.0X
-SQL Json   7948   7990 
 60  2.0 505.3   1.3X
-SQL Parquet Vectorized: DataPageV1  126149 
 22125.2   8.0  83.0X
-SQL Parquet Vectorized: DataPageV2   99113 
 17158.6   6.3 105.2X
-SQL Parquet MR: DataPageV1 1777   1784 
  9  8.8 113.0   5.9X
-SQL Parquet MR: DataPageV2 1579   1583 
  6 10.0 100.4   6.6X
-SQL ORC Vectorized  158165 
  5 99.7  10.0  66.1X
-SQL ORC MR 1654   1661 
  9  9.5 105.2   6.3X
-
-OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure
+SQL CSV   13143  13363 
311  1.2 835.6   1.0X

Review Comment:
   CSV seems to become 30% slower.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #39759: [SPARK-36124][SQL] Support subqueries with correlation through INTERSECT/EXCEPT

2023-02-17 Thread via GitHub



allisonwang-db commented on code in PR #39759:
URL: https://github.com/apache/spark/pull/39759#discussion_r1110473587


##
sql/core/src/test/resources/sql-tests/results/join-lateral.sql.out:
##
@@ -1120,6 +1258,105 @@ struct
 1  2   3
 
 
+-- !query
+SELECT * FROM t1 JOIN LATERAL
+  ((SELECT t2.c2
+  FROM   t2
+  WHERE  t2.c1 = t1.c1
+  EXCEPT DISTINCT
+  SELECT t4.c2
+  FROM   t4
+  WHERE  t4.c1 > t1.c2)
+  UNION DISTINCT
+  (SELECT t4.c1
+  FROM   t4
+  WHERE  t4.c1 <= t1.c2
+  INTERSECT ALL
+  SELECT t4.c2
+  FROM   t4
+  WHERE  t4.c1 <> t1.c1)
+)
+-- !query schema
+struct
+-- !query output
+0  1   1
+0  1   2
+0  1   3
+1  2   1
+
+
+-- !query
+SELECT * FROM t1 JOIN LATERAL
+  ((SELECT t2.c2
+  FROM   t2
+  WHERE  t2.c1 = t1.c1
+  UNION ALL
+  SELECT t4.c2
+  FROM   t4
+  WHERE  t4.c1 > t1.c2)
+  INTERSECT DISTINCT
+  (SELECT t4.c1
+  FROM   t4
+  WHERE  t4.c1 <= t1.c2
+  EXCEPT ALL
+  SELECT t4.c2
+  FROM   t4
+  WHERE  t4.c1 <> t1.c1)
+)
+-- !query schema
+struct
+-- !query output
+
+
+
+-- !query
+SELECT * FROM t1 JOIN LATERAL (SELECT sum(c1) FROM
+  (SELECT *
+  FROM   t2
+  WHERE  t2.c1 <= t1.c1) lhs
+  LEFT SEMI JOIN
+  (SELECT *
+  FROM   t4) rhs
+  ON lhs.c1 <=> rhs.c1 and lhs.c2 <=> rhs.c2
+)
+-- !query schema
+struct
+-- !query output
+0  1   0
+1  2   0
+
+
+-- !query
+SELECT * FROM t1 JOIN LATERAL (SELECT sum(c1) FROM
+  (SELECT *
+  FROM   t2
+  WHERE  t2.c1 <= t1.c1) lhs
+  LEFT SEMI JOIN
+  (SELECT *
+  FROM   t4
+  WHERE t4.c1 > t1.c2) rhs
+  ON lhs.c1 <=> rhs.c1 and lhs.c2 <=> rhs.c2
+)
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.AnalysisException
+{
+  "errorClass" : 
"UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED",

Review Comment:
   Is this expected? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results

2023-02-17 Thread via GitHub



dongjoon-hyun commented on code in PR #40072:
URL: https://github.com/apache/spark/pull/40072#discussion_r1110473324


##
sql/core/benchmarks/SortBenchmark-jdk17-results.txt:
##
@@ -2,15 +2,15 @@
 radix sort
 

 
-OpenJDK 64-Bit Server VM 17.0.5+8 on Linux 5.15.0-1023-azure
+OpenJDK 64-Bit Server VM 17.0.6+10 on Linux 5.15.0-1031-azure
 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
 radix sort 2500:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-reference TimSort key prefix array12059  12071 
 16  2.1 482.4   1.0X
-reference Arrays.sort  2864   2887 
 33  8.7 114.5   4.2X
-radix sort one byte 197203 
  8126.8   7.9  61.1X
-radix sort two bytes373375 
  2 66.9  14.9  32.3X
-radix sort eight bytes 1415   1417 
  4 17.7  56.6   8.5X
-radix sort key prefix array1930   1966 
 51 13.0  77.2   6.2X
+reference TimSort key prefix array12111  12128 
 23  2.1 484.4   1.0X
+reference Arrays.sort  2861   2885 
 35  8.7 114.4   4.2X
+radix sort one byte 197197 
  0127.0   7.9  61.5X
+radix sort two bytes371372 
  0 67.4  14.8  32.6X
+radix sort eight bytes 1391   1397 
  8 18.0  55.7   8.7X
+radix sort key prefix array1914   1951 
 52 13.1  76.6   6.3X

Review Comment:
   In this benchmark, all Java 17 results are faster than Java 8.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results

2023-02-17 Thread via GitHub



dongjoon-hyun commented on code in PR #40072:
URL: https://github.com/apache/spark/pull/40072#discussion_r1110471680


##
sql/core/benchmarks/TPCDSQueryBenchmark-jdk11-results.txt:
##
@@ -1,810 +1,810 @@
-OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure
+OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure
 Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
 TPCDS Snappy: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-q1 1772   1905 
188  0.33841.1   1.0X
+q1 1888   2074 
263  0.24092.0   1.0X
 
-OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure
+OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure
 Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
 TPCDS Snappy: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-q2 1686   1696 
 15  1.3 755.2   1.0X
+q2 1585   1899 
444  1.4 710.1   1.0X
 
-OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure
+OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure
 Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
 TPCDS Snappy: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-q3  718759 
 41  4.1 241.8   1.0X
+q3  996   1035 
 55  3.0 335.3   1.0X

Review Comment:
   Maybe, slower?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results

2023-02-17 Thread via GitHub



dongjoon-hyun commented on code in PR #40072:
URL: https://github.com/apache/spark/pull/40072#discussion_r1110470216


##
sql/core/benchmarks/UpdateFieldsBenchmark-results.txt:
##
@@ -2,25 +2,25 @@
 Add 2 columns and drop 2 columns at 3 different depths of nesting
 

 
-OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1031-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 Add 2 columns and drop 2 columns at 3 different depths of nesting:  Best 
Time(ms)   Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
-
-To non-nullable StructTypes using performant method
4  6   3  0.0  Infinity   1.0X
-To nullable StructTypes using performant method
3  4   1  0.0  Infinity   1.3X
-To non-nullable StructTypes using non-performant method   
54 63   5  0.0  Infinity   0.1X
-To nullable StructTypes using non-performant method 
2002   2091 127  0.0  Infinity   0.0X
+To non-nullable StructTypes using performant method
6  8   3  0.0  Infinity   1.0X
+To nullable StructTypes using performant method
4  5   2  0.0  Infinity   1.4X
+To non-nullable StructTypes using non-performant method   
68 73   5  0.0  Infinity   0.1X
+To nullable StructTypes using non-performant method 
2223   2452 324  0.0  Infinity   0.0X
 
 
 

 Add 50 columns and drop 50 columns at 100 different depths of nesting
 

 
-OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1031-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 Add 50 columns and drop 50 columns at 100 different depths of nesting:  Best 
Time(ms)   Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
-
-To non-nullable StructTypes using performant method
 5520   5639 168  0.0  Infinity   1.0X
-To nullable StructTypes using performant method
 2657   2708  72  0.0  Infinity   2.1X
+To non-nullable StructTypes using performant method
 3126   3150  34  0.0  Infinity   1.0X
+To nullable StructTypes using performant method
 3136   47682309  0.0  Infinity   1.0X

Review Comment:
   This looks like a regression in Java 8. We need to take a look at this later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results

2023-02-17 Thread via GitHub



dongjoon-hyun commented on code in PR #40072:
URL: https://github.com/apache/spark/pull/40072#discussion_r1110470216


##
sql/core/benchmarks/UpdateFieldsBenchmark-results.txt:
##
@@ -2,25 +2,25 @@
 Add 2 columns and drop 2 columns at 3 different depths of nesting
 

 
-OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1031-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 Add 2 columns and drop 2 columns at 3 different depths of nesting:  Best 
Time(ms)   Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
-
-To non-nullable StructTypes using performant method
4  6   3  0.0  Infinity   1.0X
-To nullable StructTypes using performant method
3  4   1  0.0  Infinity   1.3X
-To non-nullable StructTypes using non-performant method   
54 63   5  0.0  Infinity   0.1X
-To nullable StructTypes using non-performant method 
2002   2091 127  0.0  Infinity   0.0X
+To non-nullable StructTypes using performant method
6  8   3  0.0  Infinity   1.0X
+To nullable StructTypes using performant method
4  5   2  0.0  Infinity   1.4X
+To non-nullable StructTypes using non-performant method   
68 73   5  0.0  Infinity   0.1X
+To nullable StructTypes using non-performant method 
2223   2452 324  0.0  Infinity   0.0X
 
 
 

 Add 50 columns and drop 50 columns at 100 different depths of nesting
 

 
-OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1031-azure
 Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
 Add 50 columns and drop 50 columns at 100 different depths of nesting:  Best 
Time(ms)   Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
-
-To non-nullable StructTypes using performant method
 5520   5639 168  0.0  Infinity   1.0X
-To nullable StructTypes using performant method
 2657   2708  72  0.0  Infinity   2.1X
+To non-nullable StructTypes using performant method
 3126   3150  34  0.0  Infinity   1.0X
+To nullable StructTypes using performant method
 3136   47682309  0.0  Infinity   1.0X

Review Comment:
   This looks like a regression. We need to take a look at this later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhenlineo commented on a diff in pull request #40061: [SPARK-42482][CONNECT] Scala Client Write API V1

2023-02-17 Thread via GitHub



zhenlineo commented on code in PR #40061:
URL: https://github.com/apache/spark/pull/40061#discussion_r1110470096


##
connector/connect/client/jvm/src/main/java/org/apache/spark/sql/SaveMode.java:
##
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql;
+
+import org.apache.spark.annotation.Stable;
+
+/**
+ * SaveMode is used to specify the expected behavior of saving a DataFrame to 
a data source.
+ *
+ * @since 1.3.0
+ */
+@Stable
+public enum SaveMode {

Review Comment:
   I checked the MiMaExcludes file and find if we want to break spark-sql back 
compatibility, we need to have a very good reason to do so. So I leave this out 
of this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhenlineo commented on a diff in pull request #40075: [WIP] [CONNECT] Scala Client DataFrameWriterV2

2023-02-17 Thread via GitHub



zhenlineo commented on code in PR #40075:
URL: https://github.com/apache/spark/pull/40075#discussion_r1110468795


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala:
##
@@ -156,6 +156,19 @@ class ClientE2ETestSuite extends RemoteSparkSession {
 }
   }
 
+  test("write v2") {
+try {
+  spark.range(3).writeTo("myTableV2").using("parquet").create()

Review Comment:
   This API feels wrong, a few errors I get with this API:
   ```
   spark.range(3).writeTo("myTableV2").create() -> Error: should specify source
   ```
   
   ```
   spark.range(3).writeTo("myTableV2").using("parquet").createOrReplace() -> 
Error: dose not support create or select.
   ```
   
   ```
   spark.range(3).writeTo("myTableV2").using("parquet").create()
   spark.range(3).writeTo("myTableV2").append() -> Error: Cannot write into v1 
table
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on pull request #39967: [SPARK-42395][K8S]The code logic of the configmap max size validation lacks extra content

2023-02-17 Thread via GitHub



srowen commented on PR #39967:
URL: https://github.com/apache/spark/pull/39967#issuecomment-1435430725

   I don't know enough about K8S to review this. Seems harmless


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ninebigbig commented on pull request #39967: [SPARK-42395][K8S]The code logic of the configmap max size validation lacks extra content

2023-02-17 Thread via GitHub



ninebigbig commented on PR #39967:
URL: https://github.com/apache/spark/pull/39967#issuecomment-1435430134

   Can I take your free time to help me to review this pr please？@dongjoon-hyun 
 @HyukjinKwon @srowen @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ueshin opened a new pull request, #40076: [SPARK-42048][PYTHON][CONNECT] Fix the alias name for numpy literals

2023-02-17 Thread via GitHub



ueshin opened a new pull request, #40076:
URL: https://github.com/apache/spark/pull/40076

   ### What changes were proposed in this pull request?
   
   Fixes the alias name for numpy literals.
   
   Also fixes `F.lit` in Spark Connect to support `np.bool_` objects.
   
   ### Why are the changes needed?
   
   Currently the alias name for literals created from numpy scalars contains 
something like `CAST(` ... `AS )`, but it should be removed and return 
only the value string as same as literals from Python numbers.
   
   ### Does this PR introduce _any_ user-facing change?
   
   The alias name will be changed.
   
   ### How was this patch tested?
   
   Modifed/enabled related tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhenlineo opened a new pull request, #40075: [WIP] [CONNECT] Scala Client DataFrameWriterV2

2023-02-17 Thread via GitHub



zhenlineo opened a new pull request, #40075:
URL: https://github.com/apache/spark/pull/40075

   ### What changes were proposed in this pull request?
   Adding DataFrameWriterV2
   
   ### Why are the changes needed?
   Impls Dataset#writeTo
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   E2E
   
   This is based on https://github.com/apache/spark/pull/40061


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell closed pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



hvanhovell closed pull request #40070: [SPARK-42481][CONNECT] Implement 
agg.{max,min,mean,count,avg,sum}
URL: https://github.com/apache/spark/pull/40070


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



hvanhovell commented on PR #40070:
URL: https://github.com/apache/spark/pull/40070#issuecomment-1435428773

   Merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang opened a new pull request, #40074: [SPARK-42430][DOC][FOLLOW-UP] Revise the java doc for TimestampNTZ & ANSI interval types

2023-02-17 Thread via GitHub



gengliangwang opened a new pull request, #40074:
URL: https://github.com/apache/spark/pull/40074

   
   
   ### What changes were proposed in this pull request?
   
   As https://github.com/apache/spark/pull/40005#pullrequestreview-1299089504 
pointed out, the java doc for data type recommends using factory methods 
provided in org.apache.spark.sql.types.DataTypes. 
   Since the ANSI interval types missed the `DataTypes` as well, this PR also 
revise their doc.
   
   ### Why are the changes needed?
   
   
   Unify the data type doc
   ### Does this PR introduce _any_ user-facing change?
   
   
   No
   ### How was this patch tested?
   
   
   Local preview
   https://user-images.githubusercontent.com/1097932/219821685-321c2fd1-6248-4930-9c61-eec68f0dcb50.png";>
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results

2023-02-17 Thread via GitHub



dongjoon-hyun commented on code in PR #40072:
URL: https://github.com/apache/spark/pull/40072#discussion_r1110464359


##
sql/catalyst/benchmarks/HashBenchmark-jdk11-results.txt:
##
@@ -2,69 +2,69 @@
 single ints
 

 
-OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure
+OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure
 Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
 Hash For single ints: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-interpreted version3763   3769 
  8142.7   7.0   1.0X
-codegen version4658   4662 
  5115.3   8.7   0.8X
-codegen version 64-bit 4706   4710 
  6114.1   8.8   0.8X
-codegen HiveHash version   3998   3998 
  0134.3   7.4   0.9X
+interpreted version4933   4935 
  2108.8   9.2   1.0X
+codegen version5135   5141 
  9104.6   9.6   1.0X
+codegen version 64-bit 5071   5079 
 10105.9   9.4   1.0X
+codegen HiveHash version   4326   4326 
  0124.1   8.1   1.1X

Review Comment:
   Now, this is the fastest.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results

2023-02-17 Thread via GitHub



dongjoon-hyun commented on code in PR #40072:
URL: https://github.com/apache/spark/pull/40072#discussion_r1110463604


##
sql/catalyst/benchmarks/EnumTypeSetBenchmark-jdk11-results.txt:
##
@@ -1,105 +1,105 @@
-OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure
-Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure
+Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
 Test contains use empty Set:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Use HashSet   4  4 
  0226.9   4.4   1.0X
-Use EnumSet   1  1 
  0737.3   1.4   3.2X
+Use HashSet   0  1 
  0   2440.2   0.4   1.0X
+Use EnumSet   1  1 
  0884.8   1.1   0.4X

Review Comment:
   `HashSet` seems to get some improvements in this case, `contains use empty 
Set:`. The other cases looks in a reasonable range.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results

2023-02-17 Thread via GitHub



dongjoon-hyun commented on code in PR #40072:
URL: https://github.com/apache/spark/pull/40072#discussion_r1110463981


##
sql/catalyst/benchmarks/EnumTypeSetBenchmark-results.txt:
##
@@ -1,105 +1,105 @@
-OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure
+OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1031-azure
 Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
 Test contains use empty Set:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Use HashSet   5  5 
  0209.4   4.8   1.0X
-Use EnumSet   2  2 
  0459.8   2.2   2.2X
+Use HashSet   1  1 
  1   1972.0   0.5   1.0X
+Use EnumSet   2  2 
  0444.0   2.3   0.2X

Review Comment:
   ditto.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results

2023-02-17 Thread via GitHub



dongjoon-hyun commented on code in PR #40072:
URL: https://github.com/apache/spark/pull/40072#discussion_r1110463604


##
sql/catalyst/benchmarks/EnumTypeSetBenchmark-jdk11-results.txt:
##
@@ -1,105 +1,105 @@
-OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure
-Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure
+Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
 Test contains use empty Set:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Use HashSet   4  4 
  0226.9   4.4   1.0X
-Use EnumSet   1  1 
  0737.3   1.4   3.2X
+Use HashSet   0  1 
  0   2440.2   0.4   1.0X
+Use EnumSet   1  1 
  0884.8   1.1   0.4X

Review Comment:
   `HashSet` seems to get some improvements in this case, `contains use empty 
Set:`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results

2023-02-17 Thread via GitHub



dongjoon-hyun commented on code in PR #40072:
URL: https://github.com/apache/spark/pull/40072#discussion_r1110463604


##
sql/catalyst/benchmarks/EnumTypeSetBenchmark-jdk11-results.txt:
##
@@ -1,105 +1,105 @@
-OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure
-Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure
+Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
 Test contains use empty Set:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Use HashSet   4  4 
  0226.9   4.4   1.0X
-Use EnumSet   1  1 
  0737.3   1.4   3.2X
+Use HashSet   0  1 
  0   2440.2   0.4   1.0X
+Use EnumSet   1  1 
  0884.8   1.1   0.4X

Review Comment:
   `HashSet` seems to get some improvements.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results

2023-02-17 Thread via GitHub



dongjoon-hyun commented on code in PR #40072:
URL: https://github.com/apache/spark/pull/40072#discussion_r1110461118


##
sql/catalyst/benchmarks/EnumTypeSetBenchmark-jdk11-results.txt:
##
@@ -1,105 +1,105 @@
-OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure
-Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure
+Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
 Test contains use empty Set:  Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 

-Use HashSet   4  4 
  0226.9   4.4   1.0X
-Use EnumSet   1  1 
  0737.3   1.4   3.2X
+Use HashSet   0  1 
  0   2440.2   0.4   1.0X
+Use EnumSet   1  1 
  0884.8   1.1   0.4X

Review Comment:
   We need to investigate this reversed ratio.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results

2023-02-17 Thread via GitHub



dongjoon-hyun commented on code in PR #40072:
URL: https://github.com/apache/spark/pull/40072#discussion_r1110460334


##
core/benchmarks/ZStandardBenchmark-jdk11-results.txt:
##
@@ -2,26 +2,26 @@
 Benchmark ZStandardCompressionCodec
 

 
-OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure
-Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure
+Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
 Benchmark ZStandardCompressionCodec:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Compression 1 times at level 1 without buffer pool859  
  872  21  0.0   85890.3   1.0X
-Compression 1 times at level 2 without buffer pool930  
  932   2  0.0   92995.6   0.9X
-Compression 1 times at level 3 without buffer pool   1137  
 1138   2  0.0  113664.6   0.8X
-Compression 1 times at level 1 with buffer pool   662  
  664   1  0.0   66244.7   1.3X
-Compression 1 times at level 2 with buffer pool   725  
  726   1  0.0   72541.4   1.2X
-Compression 1 times at level 3 with buffer pool   929  
  930   2  0.0   92851.4   0.9X
+Compression 1 times at level 1 without buffer pool605  
  812 220  0.0   60521.0   1.0X
+Compression 1 times at level 2 without buffer pool665  
  678  20  0.0   66512.5   0.9X
+Compression 1 times at level 3 without buffer pool890  
  903  20  0.0   88961.3   0.7X
+Compression 1 times at level 1 with buffer pool   829  
  839  11  0.0   82940.2   0.7X

Review Comment:
   Java 8/17 doesn't have this regression.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #38263: [SPARK-40692][SQL] Support data masking built-in function 'mask_hash'

2023-02-17 Thread via GitHub



github-actions[bot] commented on PR #38263:
URL: https://github.com/apache/spark/pull/38263#issuecomment-1435417625

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #38505: [SPARK-40622][WIP]do not merge(try to fix build error)

2023-02-17 Thread via GitHub



github-actions[bot] commented on PR #38505:
URL: https://github.com/apache/spark/pull/38505#issuecomment-1435417586

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #37362: [SPARK-39950][SQL] It's unnecessary to materialize BroadcastQueryStage firstly, because the BroadcastQueryStage does not timeout in AQE

2023-02-17 Thread via GitHub



github-actions[bot] closed pull request #37362: [SPARK-39950][SQL] It's 
unnecessary to materialize BroadcastQueryStage firstly, because the 
BroadcastQueryStage does not timeout in AQE.
URL: https://github.com/apache/spark/pull/37362


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #38434: [SPARK-40946][SQL] Add a new DataSource V2 interface SupportsPushDownClusterKeys

2023-02-17 Thread via GitHub



github-actions[bot] commented on PR #38434:
URL: https://github.com/apache/spark/pull/38434#issuecomment-1435417608

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results

2023-02-17 Thread via GitHub



dongjoon-hyun commented on code in PR #40072:
URL: https://github.com/apache/spark/pull/40072#discussion_r1110460125


##
core/benchmarks/ZStandardBenchmark-jdk11-results.txt:
##
@@ -2,26 +2,26 @@
 Benchmark ZStandardCompressionCodec
 

 
-OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure
-Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
+OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure
+Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
 Benchmark ZStandardCompressionCodec:Best Time(ms)   Avg 
Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
 
--
-Compression 1 times at level 1 without buffer pool859  
  872  21  0.0   85890.3   1.0X
-Compression 1 times at level 2 without buffer pool930  
  932   2  0.0   92995.6   0.9X
-Compression 1 times at level 3 without buffer pool   1137  
 1138   2  0.0  113664.6   0.8X
-Compression 1 times at level 1 with buffer pool   662  
  664   1  0.0   66244.7   1.3X
-Compression 1 times at level 2 with buffer pool   725  
  726   1  0.0   72541.4   1.2X
-Compression 1 times at level 3 with buffer pool   929  
  930   2  0.0   92851.4   0.9X
+Compression 1 times at level 1 without buffer pool605  
  812 220  0.0   60521.0   1.0X
+Compression 1 times at level 2 without buffer pool665  
  678  20  0.0   66512.5   0.9X
+Compression 1 times at level 3 without buffer pool890  
  903  20  0.0   88961.3   0.7X
+Compression 1 times at level 1 with buffer pool   829  
  839  11  0.0   82940.2   0.7X

Review Comment:
   I'll take a look at this after this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] WweiL opened a new pull request, #40073: [SPARK-42484] UnsafeRowUtils better error message

2023-02-17 Thread via GitHub



WweiL opened a new pull request, #40073:
URL: https://github.com/apache/spark/pull/40073

   
   
   ### What changes were proposed in this pull request?
   
   
   Showing the essential information when throwing `InvalidUnsafeRowException`. 
Including where the check failed, and status of the `unsafeRow` and 
`expctedSchema`
   
   Example output:
   ```
   [UnsafeRowStatus] expectedSchema: 
StructType(StructField(key1,IntegerType,false),StructField(key2,IntegerType,false),StructField(sum(key1),IntegerType,false),StructField(sum(key2),IntegerType,false)),
 expectedSchemaNumFields: 4, numFields: 4, bitSetWidthInBytes: 8, 
rowSizeInBytes: 40 
   fieldStatus: 
   [UnsafeRowFieldStatus] index: 0, expectedFieldType: IntegerType, isNull: 
false, isFixedLength: true, offset: -1, size: -1
   [UnsafeRowFieldStatus] index: 1, expectedFieldType: IntegerType, isNull: 
false, isFixedLength: true, offset: -1, size: -1
   [UnsafeRowFieldStatus] index: 2, expectedFieldType: IntegerType, isNull: 
false, isFixedLength: true, offset: -1, size: -1
   [UnsafeRowFieldStatus] index: 3, expectedFieldType: IntegerType, isNull: 
false, isFixedLength: true, offset: -1, size: -1
   ```
   
   
   ### Why are the changes needed?
   
   Right now if such error happens, it's hard to track where it errored, and 
what the misbehaved row & schema looks like. With this change these information 
are more clear.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Unit tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun opened a new pull request, #40072: [SPARK-42483][TESTS] Regenerate benchmark results

2023-02-17 Thread via GitHub



dongjoon-hyun opened a new pull request, #40072:
URL: https://github.com/apache/spark/pull/40072

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



amaliujia commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110400229


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -109,7 +109,7 @@ class RelationalGroupedDataset protected[sql] (
 agg(exprs.asScala.toMap)
   }
 
-  private[this] def strToExpr(expr: String, inputExpr: proto.Expression): 
proto.Expression = {
+  private[this] def strToColumn(expr: String, inputExpr: proto.Expression): 
Column = {

Review Comment:
   I see what you are suggesting now. Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



hvanhovell commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110388531


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -2065,6 +2065,15 @@ class Dataset[T] private[sql] (val session: 
SparkSession, private[sql] val plan:
 collectResult().iterator.asInstanceOf[java.util.Iterator[T]]
   }
 
+  /**
+   * Returns the number of rows in the Dataset.
+   * @group action
+   * @since 3.4.0
+   */
+  def count(): Long = {
+groupBy().count().collect().head.getLong(0)

Review Comment:
   Well you are right... I thought I added it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



hvanhovell commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110385162


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -109,7 +109,7 @@ class RelationalGroupedDataset protected[sql] (
 agg(exprs.asScala.toMap)
   }
 
-  private[this] def strToExpr(expr: String, inputExpr: proto.Expression): 
proto.Expression = {
+  private[this] def strToColumn(expr: String, inputExpr: proto.Expression): 
Column = {

Review Comment:
   How about:
   ```scala
   private[this] def strToColumn(expr: String, inputExpr: Column): Column = {
 expr.toLowerCase(Locale.ROOT) match {
   case "avg" | "average" | "mean" => functions.avg(inputExpr)
   case "stddev" | "std" => functions.avg(inputExpr)
   case "count" | "size" => functions.count(inputExpr) // Analyzer will 
take care of * expansion
   case name => Column.fn(name, inputExpr)
 }
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



amaliujia commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110380640


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -149,4 +149,111 @@ class RelationalGroupedDataset protected[sql] (
 }
 builder.build()
   }
+
+  /**
+   * Compute aggregates by specifying a series of aggregate columns. Note that 
this function by
+   * default retains the grouping columns in its output. To not retain 
grouping columns, set
+   * `spark.sql.retainGroupColumns` to false.
+   *
+   * The available aggregate methods are defined in 
[[org.apache.spark.sql.functions]].
+   *
+   * {{{
+   *   // Selects the age of the oldest employee and the aggregate expense for 
each department
+   *
+   *   // Scala:
+   *   import org.apache.spark.sql.functions._
+   *   df.groupBy("department").agg(max("age"), sum("expense"))
+   *
+   *   // Java:
+   *   import static org.apache.spark.sql.functions.*;
+   *   df.groupBy("department").agg(max("age"), sum("expense"));
+   * }}}
+   *
+   * Note that before Spark 1.4, the default behavior is to NOT retain 
grouping columns. To change
+   * to that behavior, set config variable `spark.sql.retainGroupColumns` to 
`false`.
+   * {{{
+   *   // Scala, 1.3.x:
+   *   df.groupBy("department").agg($"department", max("age"), sum("expense"))
+   *
+   *   // Java, 1.3.x:
+   *   df.groupBy("department").agg(col("department"), max("age"), 
sum("expense"));
+   * }}}
+   *
+   * @since 3.4.0
+   */
+  @scala.annotation.varargs
+  def agg(expr: Column, exprs: Column*): DataFrame = {
+toDF((expr +: exprs).map { case c =>
+  c
+// TODO: deal with typed columns.
+})
+  }
+
+  /**
+   * Count the number of rows for each group. The resulting `DataFrame` will 
also contain the
+   * grouping columns.
+   *
+   * @since 3.4.0
+   */
+  def count(): DataFrame = 
toDF(Seq(functions.count(functions.lit(1)).alias("count")))
+
+  /**
+   * Compute the average value for each numeric columns for each group. This 
is an alias for
+   * `avg`. The resulting `DataFrame` will also contain the grouping columns. 
When specified
+   * columns are given, only compute the average values for them.
+   *
+   * @since 3.4.0
+   */
+  @scala.annotation.varargs
+  def mean(colNames: String*): DataFrame = {
+toDF(colNames.map(colName => functions.mean(colName)).toSeq)

Review Comment:
   hmmm I see. Removing those `toSeq`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40065: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.5

2023-02-17 Thread via GitHub



dongjoon-hyun commented on PR #40065:
URL: https://github.com/apache/spark/pull/40065#issuecomment-1435279944

   If you don't mind, please allow me one or two days. I'll check this during 
weekend~ Thank you for your patience always.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



amaliujia commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110378209


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] (
 agg(exprs.asScala.toMap)
   }
 
-  private[this] def strToExpr(expr: String, inputExpr: proto.Expression): 
proto.Expression = {
+  private[this] def strToExpr(expr: String, columnName: String): 
proto.Expression = {
 val builder = proto.Expression.newBuilder()
 
 expr.toLowerCase(Locale.ROOT) match {
   // We special handle a few cases that have alias that are not in 
function registry.
   case "avg" | "average" | "mean" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("avg")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.avg(columnName)

Review Comment:
   I think I did right replacement and hit a proto -> plan test generation 
failure. 
   
   I am planing look into that separately. I am gonna need some time to learn 
how to debug `org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



hvanhovell commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110376126


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -149,4 +149,111 @@ class RelationalGroupedDataset protected[sql] (
 }
 builder.build()
   }
+
+  /**
+   * Compute aggregates by specifying a series of aggregate columns. Note that 
this function by
+   * default retains the grouping columns in its output. To not retain 
grouping columns, set
+   * `spark.sql.retainGroupColumns` to false.
+   *
+   * The available aggregate methods are defined in 
[[org.apache.spark.sql.functions]].
+   *
+   * {{{
+   *   // Selects the age of the oldest employee and the aggregate expense for 
each department
+   *
+   *   // Scala:
+   *   import org.apache.spark.sql.functions._
+   *   df.groupBy("department").agg(max("age"), sum("expense"))
+   *
+   *   // Java:
+   *   import static org.apache.spark.sql.functions.*;
+   *   df.groupBy("department").agg(max("age"), sum("expense"));
+   * }}}
+   *
+   * Note that before Spark 1.4, the default behavior is to NOT retain 
grouping columns. To change
+   * to that behavior, set config variable `spark.sql.retainGroupColumns` to 
`false`.
+   * {{{
+   *   // Scala, 1.3.x:
+   *   df.groupBy("department").agg($"department", max("age"), sum("expense"))
+   *
+   *   // Java, 1.3.x:
+   *   df.groupBy("department").agg(col("department"), max("age"), 
sum("expense"));
+   * }}}
+   *
+   * @since 3.4.0
+   */
+  @scala.annotation.varargs
+  def agg(expr: Column, exprs: Column*): DataFrame = {
+toDF((expr +: exprs).map { case c =>
+  c
+// TODO: deal with typed columns.
+})
+  }
+
+  /**
+   * Count the number of rows for each group. The resulting `DataFrame` will 
also contain the
+   * grouping columns.
+   *
+   * @since 3.4.0
+   */
+  def count(): DataFrame = 
toDF(Seq(functions.count(functions.lit(1)).alias("count")))
+
+  /**
+   * Compute the average value for each numeric columns for each group. This 
is an alias for
+   * `avg`. The resulting `DataFrame` will also contain the grouping columns. 
When specified
+   * columns are given, only compute the average values for them.
+   *
+   * @since 3.4.0
+   */
+  @scala.annotation.varargs
+  def mean(colNames: String*): DataFrame = {
+toDF(colNames.map(colName => functions.mean(colName)).toSeq)

Review Comment:
   do we need `toSeq` here? I though scala varags are always a Seq...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



amaliujia commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110375466


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -2065,6 +2065,15 @@ class Dataset[T] private[sql] (val session: 
SparkSession, private[sql] val plan:
 collectResult().iterator.asInstanceOf[java.util.Iterator[T]]
   }
 
+  /**
+   * Returns the number of rows in the Dataset.
+   * @group action
+   * @since 3.4.0
+   */
+  def count(): Long = {
+groupBy().count().collect().head.getLong(0)

Review Comment:
   In my local branch that I rebased today, there is no this API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



amaliujia commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110374201


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala:
##
@@ -1199,6 +1199,34 @@ class PlanGenerationTestSuite extends ConnectFunSuite 
with BeforeAndAfterAll wit
 "b" -> "avg",
 "*" -> "size",
 "a" -> "count")
+
+simple

Review Comment:
   done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



hvanhovell commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110374514


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -2065,6 +2065,15 @@ class Dataset[T] private[sql] (val session: 
SparkSession, private[sql] val plan:
 collectResult().iterator.asInstanceOf[java.util.Iterator[T]]
   }
 
+  /**
+   * Returns the number of rows in the Dataset.
+   * @group action
+   * @since 3.4.0
+   */
+  def count(): Long = {
+groupBy().count().collect().head.getLong(0)

Review Comment:
   Didn't I implement that?
   
   lol...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



amaliujia commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110372332


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] (
 agg(exprs.asScala.toMap)
   }
 
-  private[this] def strToExpr(expr: String, inputExpr: proto.Expression): 
proto.Expression = {
+  private[this] def strToExpr(expr: String, columnName: String): 
proto.Expression = {
 val builder = proto.Expression.newBuilder()
 
 expr.toLowerCase(Locale.ROOT) match {
   // We special handle a few cases that have alias that are not in 
function registry.
   case "avg" | "average" | "mean" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("avg")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.avg(columnName)
   case "stddev" | "std" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("stddev")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.stddev(columnName)
   // Also special handle count because we need to take care count(*).
   case "count" | "size" =>
-// Turn count(*) into count(1)
-inputExpr match {
-  case s if s.hasUnresolvedStar =>
-val exprBuilder = proto.Expression.newBuilder
-exprBuilder.getLiteralBuilder.setInteger(1)
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(exprBuilder)
-  .setIsDistinct(false)
-  case _ =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
-}
+functions.col(columnName)
   case name =>
 builder.getUnresolvedFunctionBuilder
   .setFunctionName(name)
-  .addArguments(inputExpr)
+  .addArguments(df(columnName).expr)
   .setIsDistinct(false)
 }
 builder.build()
   }
+
+  /**
+   * Compute aggregates by specifying a series of aggregate columns. Note that 
this function by
+   * default retains the grouping columns in its output. To not retain 
grouping columns, set
+   * `spark.sql.retainGroupColumns` to false.
+   *
+   * The available aggregate methods are defined in 
[[org.apache.spark.sql.functions]].
+   *
+   * {{{
+   *   // Selects the age of the oldest employee and the aggregate expense for 
each department
+   *
+   *   // Scala:
+   *   import org.apache.spark.sql.functions._
+   *   df.groupBy("department").agg(max("age"), sum("expense"))
+   *
+   *   // Java:
+   *   import static org.apache.spark.sql.functions.*;
+   *   df.groupBy("department").agg(max("age"), sum("expense"));
+   * }}}
+   *
+   * Note that before Spark 1.4, the default behavior is to NOT retain 
grouping columns. To change
+   * to that behavior, set config variable `spark.sql.retainGroupColumns` to 
`false`.
+   * {{{
+   *   // Scala, 1.3.x:
+   *   df.groupBy("department").agg($"department", max("age"), sum("expense"))
+   *
+   *   // Java, 1.3.x:
+   *   df.groupBy("department").agg(col("department"), max("age"), 
sum("expense"));
+   * }}}
+   *
+   * @since 3.4.0
+   */
+  @scala.annotation.varargs
+  def agg(expr: Column, exprs: Column*): DataFrame = {
+toDF((expr +: exprs).map { case c =>

Review Comment:
   Makes sense. Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



amaliujia commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110360552


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] (
 agg(exprs.asScala.toMap)
   }
 
-  private[this] def strToExpr(expr: String, inputExpr: proto.Expression): 
proto.Expression = {
+  private[this] def strToExpr(expr: String, columnName: String): 
proto.Expression = {
 val builder = proto.Expression.newBuilder()
 
 expr.toLowerCase(Locale.ROOT) match {
   // We special handle a few cases that have alias that are not in 
function registry.
   case "avg" | "average" | "mean" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("avg")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.avg(columnName)
   case "stddev" | "std" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("stddev")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.stddev(columnName)
   // Also special handle count because we need to take care count(*).
   case "count" | "size" =>
-// Turn count(*) into count(1)
-inputExpr match {
-  case s if s.hasUnresolvedStar =>
-val exprBuilder = proto.Expression.newBuilder
-exprBuilder.getLiteralBuilder.setInteger(1)
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(exprBuilder)
-  .setIsDistinct(false)
-  case _ =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
-}
+functions.col(columnName)
   case name =>
 builder.getUnresolvedFunctionBuilder
   .setFunctionName(name)
-  .addArguments(inputExpr)
+  .addArguments(df(columnName).expr)
   .setIsDistinct(false)
 }
 builder.build()
   }
+
+  /**
+   * Compute aggregates by specifying a series of aggregate columns. Note that 
this function by
+   * default retains the grouping columns in its output. To not retain 
grouping columns, set
+   * `spark.sql.retainGroupColumns` to false.
+   *
+   * The available aggregate methods are defined in 
[[org.apache.spark.sql.functions]].
+   *
+   * {{{
+   *   // Selects the age of the oldest employee and the aggregate expense for 
each department
+   *
+   *   // Scala:
+   *   import org.apache.spark.sql.functions._
+   *   df.groupBy("department").agg(max("age"), sum("expense"))
+   *
+   *   // Java:
+   *   import static org.apache.spark.sql.functions.*;
+   *   df.groupBy("department").agg(max("age"), sum("expense"));
+   * }}}
+   *
+   * Note that before Spark 1.4, the default behavior is to NOT retain 
grouping columns. To change
+   * to that behavior, set config variable `spark.sql.retainGroupColumns` to 
`false`.
+   * {{{
+   *   // Scala, 1.3.x:
+   *   df.groupBy("department").agg($"department", max("age"), sum("expense"))
+   *
+   *   // Java, 1.3.x:
+   *   df.groupBy("department").agg(col("department"), max("age"), 
sum("expense"));
+   * }}}
+   *
+   * @since 3.4.0
+   */
+  @scala.annotation.varargs
+  def agg(expr: Column, exprs: Column*): DataFrame = {
+toDF((expr +: exprs).map { case c =>
+  c.expr
+// TODO: deal with typed columns.
+})
+  }
+
+  /**
+   * Count the number of rows for each group. The resulting `DataFrame` will 
also contain the
+   * grouping columns.
+   *
+   * @since 3.4.0
+   */
+  def count(): DataFrame = 
toDF(Seq(functions.count(functions.lit(1)).alias("count").expr))

Review Comment:
   I see what you mean now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



amaliujia commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110359583


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] (
 agg(exprs.asScala.toMap)
   }
 
-  private[this] def strToExpr(expr: String, inputExpr: proto.Expression): 
proto.Expression = {
+  private[this] def strToExpr(expr: String, columnName: String): 
proto.Expression = {
 val builder = proto.Expression.newBuilder()
 
 expr.toLowerCase(Locale.ROOT) match {
   // We special handle a few cases that have alias that are not in 
function registry.
   case "avg" | "average" | "mean" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("avg")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.avg(columnName)
   case "stddev" | "std" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("stddev")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.stddev(columnName)
   // Also special handle count because we need to take care count(*).
   case "count" | "size" =>
-// Turn count(*) into count(1)
-inputExpr match {
-  case s if s.hasUnresolvedStar =>
-val exprBuilder = proto.Expression.newBuilder
-exprBuilder.getLiteralBuilder.setInteger(1)
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(exprBuilder)
-  .setIsDistinct(false)
-  case _ =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
-}
+functions.col(columnName)
   case name =>
 builder.getUnresolvedFunctionBuilder
   .setFunctionName(name)
-  .addArguments(inputExpr)
+  .addArguments(df(columnName).expr)
   .setIsDistinct(false)
 }
 builder.build()
   }
+
+  /**
+   * Compute aggregates by specifying a series of aggregate columns. Note that 
this function by
+   * default retains the grouping columns in its output. To not retain 
grouping columns, set
+   * `spark.sql.retainGroupColumns` to false.
+   *
+   * The available aggregate methods are defined in 
[[org.apache.spark.sql.functions]].
+   *
+   * {{{
+   *   // Selects the age of the oldest employee and the aggregate expense for 
each department
+   *
+   *   // Scala:
+   *   import org.apache.spark.sql.functions._
+   *   df.groupBy("department").agg(max("age"), sum("expense"))
+   *
+   *   // Java:
+   *   import static org.apache.spark.sql.functions.*;
+   *   df.groupBy("department").agg(max("age"), sum("expense"));
+   * }}}
+   *
+   * Note that before Spark 1.4, the default behavior is to NOT retain 
grouping columns. To change
+   * to that behavior, set config variable `spark.sql.retainGroupColumns` to 
`false`.
+   * {{{
+   *   // Scala, 1.3.x:
+   *   df.groupBy("department").agg($"department", max("age"), sum("expense"))
+   *
+   *   // Java, 1.3.x:
+   *   df.groupBy("department").agg(col("department"), max("age"), 
sum("expense"));
+   * }}}
+   *
+   * @since 3.4.0
+   */
+  @scala.annotation.varargs
+  def agg(expr: Column, exprs: Column*): DataFrame = {
+toDF((expr +: exprs).map { case c =>
+  c.expr
+// TODO: deal with typed columns.
+})
+  }
+
+  /**
+   * Count the number of rows for each group. The resulting `DataFrame` will 
also contain the
+   * grouping columns.
+   *
+   * @since 3.4.0
+   */
+  def count(): DataFrame = 
toDF(Seq(functions.count(functions.lit(1)).alias("count").expr))

Review Comment:
   Wait I think we should do reverse way right? So the Dataset.count = 
groupby().count().collect()?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



hvanhovell commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110355264


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] (
 agg(exprs.asScala.toMap)
   }
 
-  private[this] def strToExpr(expr: String, inputExpr: proto.Expression): 
proto.Expression = {
+  private[this] def strToExpr(expr: String, columnName: String): 
proto.Expression = {
 val builder = proto.Expression.newBuilder()
 
 expr.toLowerCase(Locale.ROOT) match {
   // We special handle a few cases that have alias that are not in 
function registry.
   case "avg" | "average" | "mean" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("avg")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.avg(columnName)
   case "stddev" | "std" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("stddev")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.stddev(columnName)
   // Also special handle count because we need to take care count(*).
   case "count" | "size" =>
-// Turn count(*) into count(1)
-inputExpr match {
-  case s if s.hasUnresolvedStar =>
-val exprBuilder = proto.Expression.newBuilder
-exprBuilder.getLiteralBuilder.setInteger(1)
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(exprBuilder)
-  .setIsDistinct(false)
-  case _ =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
-}
+functions.col(columnName)
   case name =>
 builder.getUnresolvedFunctionBuilder
   .setFunctionName(name)
-  .addArguments(inputExpr)
+  .addArguments(df(columnName).expr)

Review Comment:
   See my earlier comment, and also the tests are broken.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



hvanhovell commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110354619


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] (
 agg(exprs.asScala.toMap)
   }
 
-  private[this] def strToExpr(expr: String, inputExpr: proto.Expression): 
proto.Expression = {
+  private[this] def strToExpr(expr: String, columnName: String): 
proto.Expression = {
 val builder = proto.Expression.newBuilder()
 
 expr.toLowerCase(Locale.ROOT) match {
   // We special handle a few cases that have alias that are not in 
function registry.
   case "avg" | "average" | "mean" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("avg")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.avg(columnName)
   case "stddev" | "std" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("stddev")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.stddev(columnName)
   // Also special handle count because we need to take care count(*).
   case "count" | "size" =>
-// Turn count(*) into count(1)
-inputExpr match {
-  case s if s.hasUnresolvedStar =>
-val exprBuilder = proto.Expression.newBuilder
-exprBuilder.getLiteralBuilder.setInteger(1)
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(exprBuilder)
-  .setIsDistinct(false)
-  case _ =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
-}
+functions.col(columnName)
   case name =>
 builder.getUnresolvedFunctionBuilder
   .setFunctionName(name)
-  .addArguments(inputExpr)
+  .addArguments(df(columnName).expr)
   .setIsDistinct(false)
 }
 builder.build()
   }
+
+  /**
+   * Compute aggregates by specifying a series of aggregate columns. Note that 
this function by
+   * default retains the grouping columns in its output. To not retain 
grouping columns, set
+   * `spark.sql.retainGroupColumns` to false.
+   *
+   * The available aggregate methods are defined in 
[[org.apache.spark.sql.functions]].
+   *
+   * {{{
+   *   // Selects the age of the oldest employee and the aggregate expense for 
each department
+   *
+   *   // Scala:
+   *   import org.apache.spark.sql.functions._
+   *   df.groupBy("department").agg(max("age"), sum("expense"))
+   *
+   *   // Java:
+   *   import static org.apache.spark.sql.functions.*;
+   *   df.groupBy("department").agg(max("age"), sum("expense"));
+   * }}}
+   *
+   * Note that before Spark 1.4, the default behavior is to NOT retain 
grouping columns. To change
+   * to that behavior, set config variable `spark.sql.retainGroupColumns` to 
`false`.
+   * {{{
+   *   // Scala, 1.3.x:
+   *   df.groupBy("department").agg($"department", max("age"), sum("expense"))
+   *
+   *   // Java, 1.3.x:
+   *   df.groupBy("department").agg(col("department"), max("age"), 
sum("expense"));
+   * }}}
+   *
+   * @since 3.4.0
+   */
+  @scala.annotation.varargs
+  def agg(expr: Column, exprs: Column*): DataFrame = {
+toDF((expr +: exprs).map { case c =>
+  c.expr
+// TODO: deal with typed columns.
+})
+  }
+
+  /**
+   * Count the number of rows for each group. The resulting `DataFrame` will 
also contain the
+   * grouping columns.
+   *
+   * @since 3.4.0
+   */
+  def count(): DataFrame = 
toDF(Seq(functions.count(functions.lit(1)).alias("count").expr))

Review Comment:
   You can replace the dataframe generation code there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



hvanhovell commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110352767


##
connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala:
##
@@ -1199,6 +1199,34 @@ class PlanGenerationTestSuite extends ConnectFunSuite 
with BeforeAndAfterAll wit
 "b" -> "avg",
 "*" -> "size",
 "a" -> "count")
+
+simple

Review Comment:
   This doesn't work. You need to construct a single `Dataset` per test, now 
you are just returning the last one. Can you split this into a bunch of tests?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



amaliujia commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110352041


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] (
 agg(exprs.asScala.toMap)
   }
 
-  private[this] def strToExpr(expr: String, inputExpr: proto.Expression): 
proto.Expression = {
+  private[this] def strToExpr(expr: String, columnName: String): 
proto.Expression = {
 val builder = proto.Expression.newBuilder()
 
 expr.toLowerCase(Locale.ROOT) match {
   // We special handle a few cases that have alias that are not in 
function registry.
   case "avg" | "average" | "mean" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("avg")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.avg(columnName)
   case "stddev" | "std" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("stddev")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.stddev(columnName)
   // Also special handle count because we need to take care count(*).
   case "count" | "size" =>
-// Turn count(*) into count(1)
-inputExpr match {
-  case s if s.hasUnresolvedStar =>
-val exprBuilder = proto.Expression.newBuilder
-exprBuilder.getLiteralBuilder.setInteger(1)
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(exprBuilder)
-  .setIsDistinct(false)
-  case _ =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
-}
+functions.col(columnName)
   case name =>
 builder.getUnresolvedFunctionBuilder
   .setFunctionName(name)
-  .addArguments(inputExpr)
+  .addArguments(df(columnName).expr)
   .setIsDistinct(false)
 }
 builder.build()
   }
+
+  /**
+   * Compute aggregates by specifying a series of aggregate columns. Note that 
this function by
+   * default retains the grouping columns in its output. To not retain 
grouping columns, set
+   * `spark.sql.retainGroupColumns` to false.
+   *
+   * The available aggregate methods are defined in 
[[org.apache.spark.sql.functions]].
+   *
+   * {{{
+   *   // Selects the age of the oldest employee and the aggregate expense for 
each department
+   *
+   *   // Scala:
+   *   import org.apache.spark.sql.functions._
+   *   df.groupBy("department").agg(max("age"), sum("expense"))
+   *
+   *   // Java:
+   *   import static org.apache.spark.sql.functions.*;
+   *   df.groupBy("department").agg(max("age"), sum("expense"));
+   * }}}
+   *
+   * Note that before Spark 1.4, the default behavior is to NOT retain 
grouping columns. To change
+   * to that behavior, set config variable `spark.sql.retainGroupColumns` to 
`false`.
+   * {{{
+   *   // Scala, 1.3.x:
+   *   df.groupBy("department").agg($"department", max("age"), sum("expense"))
+   *
+   *   // Java, 1.3.x:
+   *   df.groupBy("department").agg(col("department"), max("age"), 
sum("expense"));
+   * }}}
+   *
+   * @since 3.4.0
+   */
+  @scala.annotation.varargs
+  def agg(expr: Column, exprs: Column*): DataFrame = {
+toDF((expr +: exprs).map { case c =>
+  c.expr
+// TODO: deal with typed columns.
+})
+  }
+
+  /**
+   * Count the number of rows for each group. The resulting `DataFrame` will 
also contain the
+   * grouping columns.
+   *
+   * @since 3.4.0
+   */
+  def count(): DataFrame = 
toDF(Seq(functions.count(functions.lit(1)).alias("count").expr))

Review Comment:
   That is different. Dataset.count() returns a long: `def count(): Long`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



amaliujia commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110349585


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] (
 agg(exprs.asScala.toMap)
   }
 
-  private[this] def strToExpr(expr: String, inputExpr: proto.Expression): 
proto.Expression = {
+  private[this] def strToExpr(expr: String, columnName: String): 
proto.Expression = {
 val builder = proto.Expression.newBuilder()
 
 expr.toLowerCase(Locale.ROOT) match {
   // We special handle a few cases that have alias that are not in 
function registry.
   case "avg" | "average" | "mean" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("avg")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.avg(columnName)
   case "stddev" | "std" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("stddev")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.stddev(columnName)
   // Also special handle count because we need to take care count(*).
   case "count" | "size" =>
-// Turn count(*) into count(1)
-inputExpr match {
-  case s if s.hasUnresolvedStar =>
-val exprBuilder = proto.Expression.newBuilder
-exprBuilder.getLiteralBuilder.setInteger(1)
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(exprBuilder)
-  .setIsDistinct(false)
-  case _ =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
-}
+functions.col(columnName)
   case name =>
 builder.getUnresolvedFunctionBuilder
   .setFunctionName(name)
-  .addArguments(inputExpr)
+  .addArguments(df(columnName).expr)

Review Comment:
   Hold on that I will revert this part. It seems hit an issue somewhere if I 
switch to use functions API. I need to understand more on the functions 
implementation.
   
   I will debug this separately.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



hvanhovell commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110350177


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] (
 agg(exprs.asScala.toMap)
   }
 
-  private[this] def strToExpr(expr: String, inputExpr: proto.Expression): 
proto.Expression = {
+  private[this] def strToExpr(expr: String, columnName: String): 
proto.Expression = {
 val builder = proto.Expression.newBuilder()
 
 expr.toLowerCase(Locale.ROOT) match {
   // We special handle a few cases that have alias that are not in 
function registry.
   case "avg" | "average" | "mean" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("avg")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.avg(columnName)
   case "stddev" | "std" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("stddev")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.stddev(columnName)
   // Also special handle count because we need to take care count(*).
   case "count" | "size" =>
-// Turn count(*) into count(1)
-inputExpr match {
-  case s if s.hasUnresolvedStar =>
-val exprBuilder = proto.Expression.newBuilder
-exprBuilder.getLiteralBuilder.setInteger(1)
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(exprBuilder)
-  .setIsDistinct(false)
-  case _ =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
-}
+functions.col(columnName)
   case name =>
 builder.getUnresolvedFunctionBuilder
   .setFunctionName(name)
-  .addArguments(inputExpr)
+  .addArguments(df(columnName).expr)
   .setIsDistinct(false)
 }
 builder.build()
   }
+
+  /**
+   * Compute aggregates by specifying a series of aggregate columns. Note that 
this function by
+   * default retains the grouping columns in its output. To not retain 
grouping columns, set
+   * `spark.sql.retainGroupColumns` to false.
+   *
+   * The available aggregate methods are defined in 
[[org.apache.spark.sql.functions]].
+   *
+   * {{{
+   *   // Selects the age of the oldest employee and the aggregate expense for 
each department
+   *
+   *   // Scala:
+   *   import org.apache.spark.sql.functions._
+   *   df.groupBy("department").agg(max("age"), sum("expense"))
+   *
+   *   // Java:
+   *   import static org.apache.spark.sql.functions.*;
+   *   df.groupBy("department").agg(max("age"), sum("expense"));
+   * }}}
+   *
+   * Note that before Spark 1.4, the default behavior is to NOT retain 
grouping columns. To change
+   * to that behavior, set config variable `spark.sql.retainGroupColumns` to 
`false`.
+   * {{{
+   *   // Scala, 1.3.x:
+   *   df.groupBy("department").agg($"department", max("age"), sum("expense"))
+   *
+   *   // Java, 1.3.x:
+   *   df.groupBy("department").agg(col("department"), max("age"), 
sum("expense"));
+   * }}}
+   *
+   * @since 3.4.0
+   */
+  @scala.annotation.varargs
+  def agg(expr: Column, exprs: Column*): DataFrame = {
+toDF((expr +: exprs).map { case c =>
+  c.expr
+// TODO: deal with typed columns.
+})
+  }
+
+  /**
+   * Count the number of rows for each group. The resulting `DataFrame` will 
also contain the
+   * grouping columns.
+   *
+   * @since 3.4.0
+   */
+  def count(): DataFrame = 
toDF(Seq(functions.count(functions.lit(1)).alias("count").expr))

Review Comment:
   Can you replace `Dataset.count()` with this code?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



hvanhovell commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110349605


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] (
 agg(exprs.asScala.toMap)
   }
 
-  private[this] def strToExpr(expr: String, inputExpr: proto.Expression): 
proto.Expression = {
+  private[this] def strToExpr(expr: String, columnName: String): 
proto.Expression = {
 val builder = proto.Expression.newBuilder()
 
 expr.toLowerCase(Locale.ROOT) match {
   // We special handle a few cases that have alias that are not in 
function registry.
   case "avg" | "average" | "mean" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("avg")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.avg(columnName)
   case "stddev" | "std" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("stddev")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.stddev(columnName)
   // Also special handle count because we need to take care count(*).
   case "count" | "size" =>
-// Turn count(*) into count(1)
-inputExpr match {
-  case s if s.hasUnresolvedStar =>
-val exprBuilder = proto.Expression.newBuilder
-exprBuilder.getLiteralBuilder.setInteger(1)
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(exprBuilder)
-  .setIsDistinct(false)
-  case _ =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
-}
+functions.col(columnName)
   case name =>
 builder.getUnresolvedFunctionBuilder
   .setFunctionName(name)
-  .addArguments(inputExpr)
+  .addArguments(df(columnName).expr)
   .setIsDistinct(false)
 }
 builder.build()
   }
+
+  /**
+   * Compute aggregates by specifying a series of aggregate columns. Note that 
this function by
+   * default retains the grouping columns in its output. To not retain 
grouping columns, set
+   * `spark.sql.retainGroupColumns` to false.
+   *
+   * The available aggregate methods are defined in 
[[org.apache.spark.sql.functions]].
+   *
+   * {{{
+   *   // Selects the age of the oldest employee and the aggregate expense for 
each department
+   *
+   *   // Scala:
+   *   import org.apache.spark.sql.functions._
+   *   df.groupBy("department").agg(max("age"), sum("expense"))
+   *
+   *   // Java:
+   *   import static org.apache.spark.sql.functions.*;
+   *   df.groupBy("department").agg(max("age"), sum("expense"));
+   * }}}
+   *
+   * Note that before Spark 1.4, the default behavior is to NOT retain 
grouping columns. To change
+   * to that behavior, set config variable `spark.sql.retainGroupColumns` to 
`false`.
+   * {{{
+   *   // Scala, 1.3.x:
+   *   df.groupBy("department").agg($"department", max("age"), sum("expense"))
+   *
+   *   // Java, 1.3.x:
+   *   df.groupBy("department").agg(col("department"), max("age"), 
sum("expense"));
+   * }}}
+   *
+   * @since 3.4.0
+   */
+  @scala.annotation.varargs
+  def agg(expr: Column, exprs: Column*): DataFrame = {
+toDF((expr +: exprs).map { case c =>

Review Comment:
   TBH I would just make `toDF` take `Column`s instead of `Expression`s, that 
saves you from doing conversions everywhere.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



amaliujia commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110349585


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] (
 agg(exprs.asScala.toMap)
   }
 
-  private[this] def strToExpr(expr: String, inputExpr: proto.Expression): 
proto.Expression = {
+  private[this] def strToExpr(expr: String, columnName: String): 
proto.Expression = {
 val builder = proto.Expression.newBuilder()
 
 expr.toLowerCase(Locale.ROOT) match {
   // We special handle a few cases that have alias that are not in 
function registry.
   case "avg" | "average" | "mean" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("avg")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.avg(columnName)
   case "stddev" | "std" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("stddev")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.stddev(columnName)
   // Also special handle count because we need to take care count(*).
   case "count" | "size" =>
-// Turn count(*) into count(1)
-inputExpr match {
-  case s if s.hasUnresolvedStar =>
-val exprBuilder = proto.Expression.newBuilder
-exprBuilder.getLiteralBuilder.setInteger(1)
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(exprBuilder)
-  .setIsDistinct(false)
-  case _ =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
-}
+functions.col(columnName)
   case name =>
 builder.getUnresolvedFunctionBuilder
   .setFunctionName(name)
-  .addArguments(inputExpr)
+  .addArguments(df(columnName).expr)

Review Comment:
   Hold on that I will revert this part. It seems hit an issue somewhere if I 
switch to use functions API. I will debug that separately.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



hvanhovell commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110347846


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] (
 agg(exprs.asScala.toMap)
   }
 
-  private[this] def strToExpr(expr: String, inputExpr: proto.Expression): 
proto.Expression = {
+  private[this] def strToExpr(expr: String, columnName: String): 
proto.Expression = {
 val builder = proto.Expression.newBuilder()
 
 expr.toLowerCase(Locale.ROOT) match {
   // We special handle a few cases that have alias that are not in 
function registry.
   case "avg" | "average" | "mean" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("avg")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.avg(columnName)

Review Comment:
   You currently don't return this function, but the result of builder.build(). 
If you do, it should be `functions.avg(columnName).expr`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



hvanhovell commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110348196


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] (
 agg(exprs.asScala.toMap)
   }
 
-  private[this] def strToExpr(expr: String, inputExpr: proto.Expression): 
proto.Expression = {
+  private[this] def strToExpr(expr: String, columnName: String): 
proto.Expression = {
 val builder = proto.Expression.newBuilder()
 
 expr.toLowerCase(Locale.ROOT) match {
   // We special handle a few cases that have alias that are not in 
function registry.
   case "avg" | "average" | "mean" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("avg")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.avg(columnName)
   case "stddev" | "std" =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("stddev")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
+functions.stddev(columnName)
   // Also special handle count because we need to take care count(*).
   case "count" | "size" =>
-// Turn count(*) into count(1)
-inputExpr match {
-  case s if s.hasUnresolvedStar =>
-val exprBuilder = proto.Expression.newBuilder
-exprBuilder.getLiteralBuilder.setInteger(1)
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(exprBuilder)
-  .setIsDistinct(false)
-  case _ =>
-builder.getUnresolvedFunctionBuilder
-  .setFunctionName("count")
-  .addArguments(inputExpr)
-  .setIsDistinct(false)
-}
+functions.col(columnName)
   case name =>
 builder.getUnresolvedFunctionBuilder
   .setFunctionName(name)
-  .addArguments(inputExpr)
+  .addArguments(df(columnName).expr)

Review Comment:
   Use `Column.fn` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ueshin opened a new pull request, #40071: [SPARK-41818][CONNECT][PYTHON][FOLLOWUP][TEST] Enable a doctest for DataFrame.write

2023-02-17 Thread via GitHub



ueshin opened a new pull request, #40071:
URL: https://github.com/apache/spark/pull/40071

   ### What changes were proposed in this pull request?
   
   Enables a doctest for `DataFrame.write`.
   
   ### Why are the changes needed?
   
   Now that `DataFrame.write.saveAsTable` was fixed, we can enabled the doctest.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Enabled the doctest.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



amaliujia commented on PR #40070:
URL: https://github.com/apache/spark/pull/40070#issuecomment-1435257024

   I need to update tests in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



hvanhovell commented on code in PR #40070:
URL: https://github.com/apache/spark/pull/40070#discussion_r1110346693


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala:
##
@@ -24,7 +24,7 @@ import scala.collection.JavaConverters._
 import org.apache.spark.connect.proto
 
 /**
- * A set of methods for aggregations on a `DataFrame`, created by 
[[Dataset#groupBy groupBy]],
+ * A set of methods for aggregations on a `DataFrame`, creted by 
[[Dataset#groupBy groupBy]],

Review Comment:
   Created?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



amaliujia commented on PR #40070:
URL: https://github.com/apache/spark/pull/40070#issuecomment-1435241324

   @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia opened a new pull request, #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}

2023-02-17 Thread via GitHub



amaliujia opened a new pull request, #40070:
URL: https://github.com/apache/spark/pull/40070

   
   
   ### What changes were proposed in this pull request?
   
   Adding more API to `agg` including max,min,mean,count,avg,sum.
   
   ### Why are the changes needed?
   
   
   API coverage
   ### Does this PR introduce _any_ user-facing change?
   
   NO
   
   ### How was this patch tested?
   
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface

2023-02-17 Thread via GitHub



dtenedor commented on code in PR #40049:
URL: https://github.com/apache/spark/pull/40049#discussion_r1110299640


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ColumnDefaultValue.java:
##
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog;
+
+import java.util.Objects;
+import javax.annotation.Nonnull;
+
+import org.apache.spark.sql.connector.expressions.Literal;
+
+/**
+ * A class representing the default value of a column. It contains both the 
SQL string and literal
+ * value of the user-specified default value expression. The SQL string should 
be re-evaluated for
+ * each table writing command, which may produce different values if the 
default value expression is
+ * something like {@code CURRENT_DATE()}. The literal value is used to 
back-fill existing data if
+ * new columns with default value are added. Note: the back-fill can be lazy. 
The data sources can
+ * remember the column default value and let the reader fill the column value 
when reading existing
+ * data that do not have these new columns.
+ */
+public class ColumnDefaultValue {
+  private String sql;
+  private Literal value;

Review Comment:
   Data source developers only have to think about the existence default value. 
For any column where the corresponding field is not present in storage, the 
data source is responsible for filling this in instead of NULL.
   
   On the other hand, the current default value is for DML only. The analyzer 
inserts this expression for any explicit reference to `DEFAULT`, or for a small 
subset of implicit cases.
   
   For these fields we could clarify with comments, e.g.
   
   ```
 // This is the original string contents of the SQL expression specified at 
the
 // time the column was created in a CREATE TABLE, REPLACE TABLE, or ALTER 
TABLE
 // ADD COLUMN command. For example, for "CREATE TABLE t (col INT DEFAULT 
42)",
 // this field is equal to the string literal "42" (without quotation 
marks).
 private String sql;
 // This is the literal value corresponding to the above SQL string. For 
the above
 // example, this would be a literal integer with a value of 42.
 private Literal value;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface

2023-02-17 Thread via GitHub



dtenedor commented on code in PR #40049:
URL: https://github.com/apache/spark/pull/40049#discussion_r1110299640


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ColumnDefaultValue.java:
##
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog;
+
+import java.util.Objects;
+import javax.annotation.Nonnull;
+
+import org.apache.spark.sql.connector.expressions.Literal;
+
+/**
+ * A class representing the default value of a column. It contains both the 
SQL string and literal
+ * value of the user-specified default value expression. The SQL string should 
be re-evaluated for
+ * each table writing command, which may produce different values if the 
default value expression is
+ * something like {@code CURRENT_DATE()}. The literal value is used to 
back-fill existing data if
+ * new columns with default value are added. Note: the back-fill can be lazy. 
The data sources can
+ * remember the column default value and let the reader fill the column value 
when reading existing
+ * data that do not have these new columns.
+ */
+public class ColumnDefaultValue {
+  private String sql;
+  private Literal value;

Review Comment:
   Data source developers only have to think about the existence default value. 
For any column where the corresponding field is not present in storage, the 
data source is responsible for filling this in instead of NULL.
   
   On the other hand, the current default value is for DML only.
   
   For these fields we could clarify with comments, e.g.
   
   ```
 // This is the original string contents of the SQL expression specified at 
the
 // time the column was created in a CREATE TABLE, REPLACE TABLE, or ALTER 
TABLE
 // ADD COLUMN command. For example, for "CREATE TABLE t (col INT DEFAULT 
42)",
 // this field is equal to the string literal "42" (without quotation 
marks).
 private String sql;
 // This is the literal value corresponding to the above SQL string. For 
the above
 // example, this would be a literal integer with a value of 42.
 private Literal value;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface

2023-02-17 Thread via GitHub



dtenedor commented on code in PR #40049:
URL: https://github.com/apache/spark/pull/40049#discussion_r1110299640


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ColumnDefaultValue.java:
##
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog;
+
+import java.util.Objects;
+import javax.annotation.Nonnull;
+
+import org.apache.spark.sql.connector.expressions.Literal;
+
+/**
+ * A class representing the default value of a column. It contains both the 
SQL string and literal
+ * value of the user-specified default value expression. The SQL string should 
be re-evaluated for
+ * each table writing command, which may produce different values if the 
default value expression is
+ * something like {@code CURRENT_DATE()}. The literal value is used to 
back-fill existing data if
+ * new columns with default value are added. Note: the back-fill can be lazy. 
The data sources can
+ * remember the column default value and let the reader fill the column value 
when reading existing
+ * data that do not have these new columns.
+ */
+public class ColumnDefaultValue {
+  private String sql;
+  private Literal value;

Review Comment:
   Data source developers only have to think about the existence default value. 
For any column where the corresponding field is not present in storage, the 
data source is responsible for filling this in instead of NULL.
   
   On the other hand, the existence default value is for DML only.
   
   For these fields we could clarify with comments, e.g.
   
   ```
 // This is the original string contents of the SQL expression specified at 
the
 // time the column was created in a CREATE TABLE, REPLACE TABLE, or ALTER 
TABLE
 // ADD COLUMN command. For example, for "CREATE TABLE t (col INT DEFAULT 
42)",
 // this field is equal to the string literal "42" (without quotation 
marks).
 private String sql;
 // This is the literal value corresponding to the above SQL string. For 
the above
 // example, this would be a literal integer with a value of 42.
 private Literal value;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #40054: [SPARK-42477] [CONNECT] [PYTHON]: accept user_agent in spark connect's connection string

2023-02-17 Thread via GitHub



amaliujia commented on code in PR #40054:
URL: https://github.com/apache/spark/pull/40054#discussion_r1110286421


##
connector/connect/docs/client-connection-string.md:
##
@@ -81,6 +82,15 @@ sc://hostname:port/;param1=value;param2=value
 user_id=Martin
 
   
+  
+user_agent
+String
+The user agent acting on behalf of the user, typically applications
+that use Spark Connect to implement its functionality and execute Spark
+requests on behalf of the user.
+Default: _SPARK_CONNECT_PYTHON in the Python client

Review Comment:
   I see that this is changed to default value in Python client. 
   
   This is fine. Probably in the future we just list the default value for the 
Scala client.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #40054: [SPARK-42477] [CONNECT] [PYTHON]: accept user_agent in spark connect's connection string

2023-02-17 Thread via GitHub



amaliujia commented on code in PR #40054:
URL: https://github.com/apache/spark/pull/40054#discussion_r1110284777


##
connector/connect/docs/client-connection-string.md:
##
@@ -81,6 +82,15 @@ sc://hostname:port/;param1=value;param2=value
 user_id=Martin
 
   
+  
+user_agent
+String
+The user agent acting on behalf of the user, typically applications
+that use Spark Connect to implement its functionality and execute Spark
+requests on behalf of the user.
+Default: _SPARK_CONNECT_PYTHON in the Python client

Review Comment:
   Wait I thought this is addressed: the default value is empty right? 
`_SPARK_CONNECT_PYTHON` is Python client set value and Scala client for example 
will set a different value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40069: [SPARK-42480][SQL] Improve the performance of drop partitions

2023-02-17 Thread via GitHub



LuciferYang commented on code in PR #40069:
URL: https://github.com/apache/spark/pull/40069#discussion_r1110257681


##
sql/catalyst/src/main/scala/org/apache/spark/sql/util/PartitioningUtils.scala:
##
@@ -30,6 +30,7 @@ import org.apache.spark.sql.types.{CharType, DataType, 
StringType, StructField,
 import org.apache.spark.unsafe.types.UTF8String
 
 private[sql] object PartitioningUtils {
+  private val PATTERN_FOR_KEY_EQ_VAL = "(.+)=(.+)".r

Review Comment:
   So if the data corresponding to the partition `a=1` is stored in dir `/1/`, 
will there be a bad case with this pr?
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40069: [SPARK-42480][SQL] Improve the performance of drop partitions

2023-02-17 Thread via GitHub



LuciferYang commented on code in PR #40069:
URL: https://github.com/apache/spark/pull/40069#discussion_r1110243153


##
sql/catalyst/src/main/scala/org/apache/spark/sql/util/PartitioningUtils.scala:
##
@@ -30,6 +30,7 @@ import org.apache.spark.sql.types.{CharType, DataType, 
StringType, StructField,
 import org.apache.spark.unsafe.types.UTF8String
 
 private[sql] object PartitioningUtils {
+  private val PATTERN_FOR_KEY_EQ_VAL = "(.+)=(.+)".r

Review Comment:
   Too idealistic, not all partition tables follow this rule. For example, we 
can use 
   `alter table ... partition(...) set location ...`  to relocate the partition 
to any directory



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #40065: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.5

2023-02-17 Thread via GitHub



LuciferYang commented on PR #40065:
URL: https://github.com/apache/spark/pull/40065#issuecomment-1435129166

   Yeah, Spark 3.4.0 does not need this pr.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40069: [SPARK-42480][SQL] Improve the performance of drop partitions

2023-02-17 Thread via GitHub



LuciferYang commented on code in PR #40069:
URL: https://github.com/apache/spark/pull/40069#discussion_r1110243153


##
sql/catalyst/src/main/scala/org/apache/spark/sql/util/PartitioningUtils.scala:
##
@@ -30,6 +30,7 @@ import org.apache.spark.sql.types.{CharType, DataType, 
StringType, StructField,
 import org.apache.spark.unsafe.types.UTF8String
 
 private[sql] object PartitioningUtils {
+  private val PATTERN_FOR_KEY_EQ_VAL = "(.+)=(.+)".r

Review Comment:
   Too idealistic, not all partition tables follow this rule. For example, we 
can use 
   `alter table ... partition(...) set location ...`  to relocate the partition 
to any directory.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40065: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.5

2023-02-17 Thread via GitHub



dongjoon-hyun commented on PR #40065:
URL: https://github.com/apache/spark/pull/40065#issuecomment-1435116865

   I'm trying to assess the issue. So, those combination issue is not the AS-IS 
Apache Spark issue in both master/branch-3.4, right?
   
   FYI, Cyclone plugin 2.7.4 issue is a known one. When I started SBOM works, 
2.7.4 was the lastest but was unusable across multiple ASF projects. That was 
the main reason I chose 2.7.3 instead of the latest at that time. I'm not quite 
sure if 2.7.5 is stable enough.
   
   Anyway, we can apply this PR on `master` branch for Apache Spark 3.5.0 only 
separately from the Maven issue. Maven is also another big issues always.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] rehevkor5 commented on a diff in pull request #29591: [SPARK-32714][PYTHON] Initial pyspark-stubs port.

2023-02-17 Thread via GitHub



rehevkor5 commented on code in PR #29591:
URL: https://github.com/apache/spark/pull/29591#discussion_r1110230025


##
python/pyspark/sql/dataframe.pyi:
##
@@ -0,0 +1,324 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from typing import overload
+from typing import (
+Any,
+Callable,
+Dict,
+Iterator,
+List,
+Optional,
+Tuple,
+Union,
+)
+
+from py4j.java_gateway import JavaObject  # type: ignore[import]
+
+from pyspark.sql._typing import ColumnOrName, LiteralType, 
OptionalPrimitiveType
+from pyspark.sql.types import (  # noqa: F401
+StructType,
+StructField,
+StringType,
+IntegerType,
+Row,
+)  # noqa: F401
+from pyspark.sql.context import SQLContext
+from pyspark.sql.group import GroupedData
+from pyspark.sql.readwriter import DataFrameWriter, DataFrameWriterV2
+from pyspark.sql.streaming import DataStreamWriter
+from pyspark.sql.column import Column
+from pyspark.rdd import RDD
+from pyspark.storagelevel import StorageLevel
+
+from pyspark.sql.pandas.conversion import PandasConversionMixin
+from pyspark.sql.pandas.map_ops import PandasMapOpsMixin
+
+class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
+sql_ctx: SQLContext
+is_cached: bool
+def __init__(self, jdf: JavaObject, sql_ctx: SQLContext) -> None: ...
+@property
+def rdd(self) -> RDD[Row]: ...
+@property
+def na(self) -> DataFrameNaFunctions: ...
+@property
+def stat(self) -> DataFrameStatFunctions: ...
+def toJSON(self, use_unicode: bool = ...) -> RDD[str]: ...
+def registerTempTable(self, name: str) -> None: ...
+def createTempView(self, name: str) -> None: ...
+def createOrReplaceTempView(self, name: str) -> None: ...
+def createGlobalTempView(self, name: str) -> None: ...
+@property
+def write(self) -> DataFrameWriter: ...
+@property
+def writeStream(self) -> DataStreamWriter: ...
+@property
+def schema(self) -> StructType: ...
+def printSchema(self) -> None: ...
+def explain(
+self, extended: Optional[Union[bool, str]] = ..., mode: Optional[str] 
= ...
+) -> None: ...
+def exceptAll(self, other: DataFrame) -> DataFrame: ...
+def isLocal(self) -> bool: ...
+@property
+def isStreaming(self) -> bool: ...
+def show(
+self, n: int = ..., truncate: Union[bool, int] = ..., vertical: bool = 
...
+) -> None: ...
+def checkpoint(self, eager: bool = ...) -> DataFrame: ...
+def localCheckpoint(self, eager: bool = ...) -> DataFrame: ...
+def withWatermark(
+self, eventTime: ColumnOrName, delayThreshold: str
+) -> DataFrame: ...
+def hint(self, name: str, *parameters: Any) -> DataFrame: ...
+def count(self) -> int: ...
+def collect(self) -> List[Row]: ...
+def toLocalIterator(self, prefetchPartitions: bool = ...) -> 
Iterator[Row]: ...
+def limit(self, num: int) -> DataFrame: ...
+def take(self, num: int) -> List[Row]: ...
+def tail(self, num: int) -> List[Row]: ...
+def foreach(self, f: Callable[[Row], None]) -> None: ...
+def foreachPartition(self, f: Callable[[Iterator[Row]], None]) -> None: ...

Review Comment:
   Shouldn't this be `Iterable[Row]` instead of `Iterator[Row]`, to match 
https://github.com/apache/spark/pull/29591/files#diff-6349afe05d41878cc15995c96a14b011d6aef04b779e136f711eab989b71da6cR215
 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] rehevkor5 commented on a diff in pull request #34225: [SPARK-36885][PYTHON] Inline type hints for pyspark.sql.dataframe

2023-02-17 Thread via GitHub



rehevkor5 commented on code in PR #34225:
URL: https://github.com/apache/spark/pull/34225#discussion_r1110226166


##
python/pyspark/sql/dataframe.py:
##
@@ -784,7 +824,7 @@ def foreach(self, f):
 """
 self.rdd.foreach(f)
 
-def foreachPartition(self, f):
+def foreachPartition(self, f: Callable[[Iterator[Row]], None]) -> None:

Review Comment:
   Shouldn't this be `Iterable[Row]` instead of `Iterator[Row]`, so that it 
matches https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1750 
?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface

2023-02-17 Thread via GitHub



gengliangwang commented on code in PR #40049:
URL: https://github.com/apache/spark/pull/40049#discussion_r1110209562


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ColumnDefaultValue.java:
##
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog;
+
+import java.util.Objects;
+import javax.annotation.Nonnull;
+
+import org.apache.spark.sql.connector.expressions.Literal;
+
+/**
+ * A class representing the default value of a column. It contains both the 
SQL string and literal
+ * value of the user-specified default value expression. The SQL string should 
be re-evaluated for
+ * each table writing command, which may produce different values if the 
default value expression is
+ * something like {@code CURRENT_DATE()}. The literal value is used to 
back-fill existing data if
+ * new columns with default value are added. Note: the back-fill can be lazy. 
The data sources can
+ * remember the column default value and let the reader fill the column value 
when reading existing
+ * data that do not have these new columns.
+ */
+public class ColumnDefaultValue {
+  private String sql;
+  private Literal value;

Review Comment:
   Actually, I have read the classdoc before commenting...I don't have a better 
suggestion. Let's enhance the doc later



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wecharyu opened a new pull request, #40069: [SPARK-42480][SQL] Improve the performance of drop partitions

2023-02-17 Thread via GitHub



wecharyu opened a new pull request, #40069:
URL: https://github.com/apache/spark/pull/40069

   
   ### What changes were proposed in this pull request?
   1. Change to get matching partition names rather than partition objects when 
drop partitions
   
   
   ### Why are the changes needed?
   1. Partition names are enough to drop partitions
   2. It can reduce the time overhead and driver memory overhead. 
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Passing the existing tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #40065: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.5

2023-02-17 Thread via GitHub



LuciferYang commented on PR #40065:
URL: https://github.com/apache/spark/pull/40065#issuecomment-1435050161

   Please let me explain my intention more:
   
   1. First of all, I want to update maven to 3.9.0(keep use CycloneDX 2.7.3), 
then I found the following error:
   
   ```
   [ERROR] An error occurred attempting to read POM
   org.codehaus.plexus.util.xml.pull.XmlPullParserException: UTF-8 BOM plus xml 
decl of ISO-8859-1 is incompatible (position: START_DOCUMENT seen https://github.com/LuciferYang/spark/actions/runs/4206035140/jobs/7299042843 
later
   
   2. then I want to test maven 3.9.0 + CycloneDX 2.7.4 couple of days ago, but 
there an error same as  `maven 3.8.7 + cyclonedx-maven-plugin 2.7.4`,  I think 
we should see it here: https://github.com/LuciferYang/spark/runs/11424487074 
later
   
   3. then I test maven 3.9.0 + CycloneDX 2.7.5 today, there is no above 
issues(we can check https://github.com/LuciferYang/spark/runs/11424568023 
later).
   
   If I want to upgrade Spark to use maven 3.9.0, I must upgrade 
cyclonedx-maven-plugin to 2.7.5, so should I upgrade them in one pr at the same 
time? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #40065: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.5

2023-02-17 Thread via GitHub



LuciferYang commented on PR #40065:
URL: https://github.com/apache/spark/pull/40065#issuecomment-1435045931

   Yes, we use CycloneDX 2.7.3. So I should not explain that 2.7.4 has such 
issue in the pr description, because it does not affect Spark now, am I right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #39942: [SPARK-42398][SQL] Refine default column value framework

2023-02-17 Thread via GitHub



cloud-fan commented on code in PR #39942:
URL: https://github.com/apache/spark/pull/39942#discussion_r1110153301


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -3108,7 +3108,7 @@ object SQLConf {
 "provided values when the corresponding fields are not present in 
storage.")
   .version("3.4.0")
   .stringConf
-  .createWithDefault("csv,json,orc,parquet")
+  .createWithDefault("csv,json,orc,parquet,hive")

Review Comment:
   ```suggestion
 .createWithDefault("csv,json,orc,parquet")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface

2023-02-17 Thread via GitHub



cloud-fan commented on code in PR #40049:
URL: https://github.com/apache/spark/pull/40049#discussion_r1110150524


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ColumnDefaultValue.java:
##
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog;
+
+import java.util.Objects;
+import javax.annotation.Nonnull;
+
+import org.apache.spark.sql.connector.expressions.Literal;
+
+/**
+ * A class representing the default value of a column. It contains both the 
SQL string and literal
+ * value of the user-specified default value expression. The SQL string should 
be re-evaluated for
+ * each table writing command, which may produce different values if the 
default value expression is
+ * something like {@code CURRENT_DATE()}. The literal value is used to 
back-fill existing data if
+ * new columns with default value are added. Note: the back-fill can be lazy. 
The data sources can
+ * remember the column default value and let the reader fill the column value 
when reading existing
+ * data that do not have these new columns.
+ */
+public class ColumnDefaultValue {
+  private String sql;
+  private Literal value;

Review Comment:
   Can you also read the classdoc? If you still think the name is confusing, 
let's figure out a better one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface

2023-02-17 Thread via GitHub



cloud-fan commented on code in PR #40049:
URL: https://github.com/apache/spark/pull/40049#discussion_r1110146649


##
sql/core/src/test/scala/org/apache/spark/sql/connector/TestV2SessionCatalogBase.scala:
##
@@ -64,6 +64,15 @@ private[connector] trait TestV2SessionCatalogBase[T <: 
Table] extends Delegating
 }
   }
 
+  override def createTable(
+  ident: Identifier,
+  columns: Array[Column],
+  partitions: Array[Transform],
+  properties: java.util.Map[String, String]): Table = {
+createTable(ident, CatalogV2Util.v2ColumnsToStructType(columns), 
partitions, properties)
+  }
+
+  // TODO: remove it when no tests calling this deprecated method.

Review Comment:
   I haven't created it. This is a test-only change so the priority is low.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface

2023-02-17 Thread via GitHub



cloud-fan commented on code in PR #40049:
URL: https://github.com/apache/spark/pull/40049#discussion_r1110135070


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ColumnDefaultValue.java:
##
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.catalog;
+
+import java.util.Objects;
+import javax.annotation.Nonnull;
+
+import org.apache.spark.sql.connector.expressions.Literal;
+
+/**
+ * A class representing the default value of a column. It contains both the 
SQL string and literal
+ * value of the user-specified default value expression. The SQL string should 
be re-evaluated for
+ * each table writing command, which may produce different values if the 
default value expression is
+ * something like {@code CURRENT_DATE()}. The literal value is used to 
back-fill existing data if
+ * new columns with default value are added. Note: the back-fill can be lazy. 
The data sources can
+ * remember the column default value and let the reader fill the column value 
when reading existing
+ * data that do not have these new columns.
+ */
+public class ColumnDefaultValue {
+  private String sql;
+  private Literal value;

Review Comment:
   A default value has two parts: the SQL string and the evaluated literal 
value. I don't think current default and exist default is easier to understand 
for data source developers.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #40065: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.5

2023-02-17 Thread via GitHub



dongjoon-hyun commented on PR #40065:
URL: https://github.com/apache/spark/pull/40065#issuecomment-1435007566

   I mean in our GitHub Action repo. We are using CycloneDX 2.7.3, aren't we?
   
   > I make another one build with maven 3.8.7 + cyclonedx-maven-plugin 2.7.4 
https://github.com/LuciferYang/spark/actions/runs/4205904014/jobs/7298678641


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 141 matches

Mail list logo