[GitHub] [spark] MaxGekk closed pull request #40074: [SPARK-42430][DOC][FOLLOW-UP] Revise the java doc for TimestampNTZ & ANSI interval types
MaxGekk closed pull request #40074: [SPARK-42430][DOC][FOLLOW-UP] Revise the java doc for TimestampNTZ & ANSI interval types URL: https://github.com/apache/spark/pull/40074 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #40074: [SPARK-42430][DOC][FOLLOW-UP] Revise the java doc for TimestampNTZ & ANSI interval types
MaxGekk commented on PR #40074: URL: https://github.com/apache/spark/pull/40074#issuecomment-1435512116 +1, LGTM. Merging to master/3.4. Thank you, @gengliangwang. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #40074: [SPARK-42430][DOC][FOLLOW-UP] Revise the java doc for TimestampNTZ & ANSI interval types
MaxGekk commented on PR #40074: URL: https://github.com/apache/spark/pull/40074#issuecomment-1435511800 The kub test is not related to this PR, I believe: ``` [info] org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite *** ABORTED *** (27 minutes, 35 seconds) [info] io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://192.168.49.2:8443/api/v1/namespaces. Message: object is being deleted: namespaces "spark-e7de0ffd81044f09afb2693a0e227a43" already exists. Received status: ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #39910: [SPARK-42337][SQL] Add error class INVALID_TEMP_OBJ_REFERENCE
MaxGekk closed pull request #39910: [SPARK-42337][SQL] Add error class INVALID_TEMP_OBJ_REFERENCE URL: https://github.com/apache/spark/pull/39910 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #39910: [SPARK-42337][SQL] Add error class INVALID_TEMP_OBJ_REFERENCE
MaxGekk commented on PR #39910: URL: https://github.com/apache/spark/pull/39910#issuecomment-1435510023 +1, LGTM. Merging to master/3.4. Thank you, @allisonwang-db. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #39910: [SPARK-42337][SQL] Add error class INVALID_TEMP_OBJ_REFERENCE
MaxGekk commented on PR #39910: URL: https://github.com/apache/spark/pull/39910#issuecomment-1435509673 The kub test is not related to this PR, I think: ``` [info] - SPARK-38187: Run SparkPi Jobs with minCPU *** FAILED *** (3 minutes, 1 second) [info] The code passed to eventually never returned normally. Attempted 189 times over 3.0106043718 minutes. Last failure message: 0 did not equal 2. (VolcanoTestsSuite.scala:302) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
ivoson commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1110773664 ## core/src/main/scala/org/apache/spark/storage/BlockManager.scala: ## @@ -1325,31 +1328,71 @@ private[spark] class BlockManager( blockInfoManager.releaseAllLocksForTask(taskAttemptId) } + /** + * Retrieve the given rdd block if it exists and is visible, otherwise call the provided + * `makeIterator` method to compute the block, persist it, and return its values. + * + * @return either a BlockResult if the block was successfully cached, or an iterator if the block + * could not be cached. + */ + def getOrElseUpdateRDDBlock[T]( + taskId: Long, + blockId: RDDBlockId, + level: StorageLevel, + classTag: ClassTag[T], + makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = { +val isCacheVisible = isRDDBlockVisible(blockId) +val res = getOrElseUpdate(blockId, level, classTag, makeIterator, isCacheVisible) +if (res.isLeft && !isCacheVisible) { + // Block exists but not visible, report taskId -> blockId info to master. + master.updateRDDBlockTaskInfo(blockId, taskId) +} + +res + } + /** * Retrieve the given block if it exists, otherwise call the provided `makeIterator` method * to compute the block, persist it, and return its values. * * @return either a BlockResult if the block was successfully cached, or an iterator if the block * could not be cached. */ - def getOrElseUpdate[T]( + private def getOrElseUpdate[T]( blockId: BlockId, level: StorageLevel, classTag: ClassTag[T], - makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = { -// Attempt to read the block from local or remote storage. If it's present, then we don't need -// to go through the local-get-or-put path. -get[T](blockId)(classTag) match { - case Some(block) => -return Left(block) - case _ => -// Need to compute the block. + makeIterator: () => Iterator[T], + isCacheVisible: Boolean = true): Either[BlockResult, Iterator[T]] = { +// Track whether the data is computed or not, force to do the computation later if need to. +// The reason we push the force computing later is that once the executor is decommissioned we +// will have a better chance to replicate the cache block because of the `checkShouldStore` +// validation when putting a new block. +var computed: Boolean = false +val iterator = () => { + computed = true + makeIterator() +} +if (isCacheVisible) { + // Attempt to read the block from local or remote storage. If it's present, then we don't need + // to go through the local-get-or-put path. + get[T](blockId)(classTag) match { +case Some(block) => + return Left(block) +case _ => + // Need to compute the block. + } } + // Initially we hold no locks on this block. -doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true) match { +doPutIterator(blockId, iterator, level, classTag, keepReadLock = true) match { case None => // doPut() didn't hand work back to us, so the block already existed or was successfully // stored. Therefore, we now hold a read lock on the block. +if (!isCacheVisible && !computed) { + // Force compute to report accumulator updates. Review Comment: Yes, the recomputation is only for updating accumulators. The reulst should be the same unless the result is indeterminate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results
dongjoon-hyun closed pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results URL: https://github.com/apache/spark/pull/40072 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results
dongjoon-hyun commented on PR #40072: URL: https://github.com/apache/spark/pull/40072#issuecomment-1435501963 Thank you so much always for your help, @viirya ! Merged to master for Apache Spark 3.5. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results
viirya commented on code in PR #40072: URL: https://github.com/apache/spark/pull/40072#discussion_r1110703022 ## sql/core/benchmarks/DataSourceReadBenchmark-results.txt: ## @@ -2,430 +2,430 @@ SQL Single Numeric Column Scan -OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure +OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1031-azure Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz SQL Single BOOLEAN Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -SQL CSV 10433 10554 172 1.5 663.3 1.0X -SQL Json 7948 7990 60 2.0 505.3 1.3X -SQL Parquet Vectorized: DataPageV1 126149 22125.2 8.0 83.0X -SQL Parquet Vectorized: DataPageV2 99113 17158.6 6.3 105.2X -SQL Parquet MR: DataPageV1 1777 1784 9 8.8 113.0 5.9X -SQL Parquet MR: DataPageV2 1579 1583 6 10.0 100.4 6.6X -SQL ORC Vectorized 158165 5 99.7 10.0 66.1X -SQL ORC MR 1654 1661 9 9.5 105.2 6.3X - -OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure +SQL CSV 13143 13363 311 1.2 835.6 1.0X Review Comment: Hmm, it's significant. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40053: [SPARK-42470][SQL] Remove unused declarations from Hive module
dongjoon-hyun commented on PR #40053: URL: https://github.com/apache/spark/pull/40053#issuecomment-1435487106 Thank you, @LuciferYang and @huaxingao ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results
dongjoon-hyun commented on PR #40072: URL: https://github.com/apache/spark/pull/40072#issuecomment-1435486887 When you have some time, could you review this, @viirya ? I want to merge this to proceed the further investigations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #40053: [SPARK-42470][SQL] Remove unused declarations from Hive module
huaxingao commented on PR #40053: URL: https://github.com/apache/spark/pull/40053#issuecomment-1435483886 Merged to master. Thanks @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao closed pull request #40053: [SPARK-42470][SQL] Remove unused declarations from Hive module
huaxingao closed pull request #40053: [SPARK-42470][SQL] Remove unused declarations from Hive module URL: https://github.com/apache/spark/pull/40053 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40065: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.5
dongjoon-hyun commented on PR #40065: URL: https://github.com/apache/spark/pull/40065#issuecomment-1435482433 Got it. Thank you for informing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #40069: [SPARK-42480][SQL] Improve the performance of drop partitions
LuciferYang commented on code in PR #40069: URL: https://github.com/apache/spark/pull/40069#discussion_r1110563308 ## sql/catalyst/src/main/scala/org/apache/spark/sql/util/PartitioningUtils.scala: ## @@ -30,6 +30,7 @@ import org.apache.spark.sql.types.{CharType, DataType, StringType, StructField, import org.apache.spark.unsafe.types.UTF8String private[sql] object PartitioningUtils { + private val PATTERN_FOR_KEY_EQ_VAL = "(.+)=(.+)".r Review Comment: I remember seeing similar cases in the production environment, but I can't remember the details. Need to have tests to check the corner scenes we can think of cc @wangyum @sunchao FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #40069: [SPARK-42480][SQL] Improve the performance of drop partitions
LuciferYang commented on code in PR #40069: URL: https://github.com/apache/spark/pull/40069#discussion_r1110563308 ## sql/catalyst/src/main/scala/org/apache/spark/sql/util/PartitioningUtils.scala: ## @@ -30,6 +30,7 @@ import org.apache.spark.sql.types.{CharType, DataType, StringType, StructField, import org.apache.spark.unsafe.types.UTF8String private[sql] object PartitioningUtils { + private val PATTERN_FOR_KEY_EQ_VAL = "(.+)=(.+)".r Review Comment: cc @wangyum @sunchao FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ozhembr opened a new pull request, #40077: [SPIP][POC] Driver scaling: parallel schedulers
ozhembr opened a new pull request, #40077: URL: https://github.com/apache/spark/pull/40077 ### What changes were proposed in this pull request? SPIP: https://docs.google.com/document/d/1_MVEpGxz6U_CNqKArR1M1l2oP-3I7O67grfwPtniLaA/edit?usp=sharing POC of scaling Spark Driver via parallel schedulers. Uses multiple groups of `CoarseGrainedSchedulerBackend, TaskSchedulerImpl` ### Why are the changes needed? Low performance of Spark Driver with multiple large jobs. ### Does this PR introduce _any_ user-facing change? Configs for enabling parallel schedulers: `spark.driver.schedulers.parallelism` - number of parallel schedulers, no value or <= 1 will disable parallelism ### How was this patch tested? Comparison tests with spark-sql processes of same parallelism level -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40065: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.5
LuciferYang commented on PR #40065: URL: https://github.com/apache/spark/pull/40065#issuecomment-1435477929 @dongjoon-hyun found a new issue related to 2.7.5: https://github.com/CycloneDX/cyclonedx-maven-plugin/issues/284 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wecharyu commented on a diff in pull request #40069: [SPARK-42480][SQL] Improve the performance of drop partitions
wecharyu commented on code in PR #40069: URL: https://github.com/apache/spark/pull/40069#discussion_r1110559214 ## sql/catalyst/src/main/scala/org/apache/spark/sql/util/PartitioningUtils.scala: ## @@ -30,6 +30,7 @@ import org.apache.spark.sql.types.{CharType, DataType, StringType, StructField, import org.apache.spark.unsafe.types.UTF8String private[sql] object PartitioningUtils { + private val PATTERN_FOR_KEY_EQ_VAL = "(.+)=(.+)".r Review Comment: @LuciferYang Thanks for your review, partition name is always followed this rule in Hive [makePartName](https://github.com/apache/hive/blob/1747bc7b8c58d84f455c95f26f1183a65246f331/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java#L571). Partition name is only related to partition keys and values, other partition fields like location will not affect it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jchen5 commented on a diff in pull request #39759: [SPARK-36124][SQL] Support subqueries with correlation through INTERSECT/EXCEPT
jchen5 commented on code in PR #39759: URL: https://github.com/apache/spark/pull/39759#discussion_r1110500495 ## sql/core/src/test/resources/sql-tests/results/join-lateral.sql.out: ## @@ -1120,6 +1258,105 @@ struct 1 2 3 +-- !query +SELECT * FROM t1 JOIN LATERAL + ((SELECT t2.c2 + FROM t2 + WHERE t2.c1 = t1.c1 + EXCEPT DISTINCT + SELECT t4.c2 + FROM t4 + WHERE t4.c1 > t1.c2) + UNION DISTINCT + (SELECT t4.c1 + FROM t4 + WHERE t4.c1 <= t1.c2 + INTERSECT ALL + SELECT t4.c2 + FROM t4 + WHERE t4.c1 <> t1.c1) +) +-- !query schema +struct +-- !query output +0 1 1 +0 1 2 +0 1 3 +1 2 1 + + +-- !query +SELECT * FROM t1 JOIN LATERAL + ((SELECT t2.c2 + FROM t2 + WHERE t2.c1 = t1.c1 + UNION ALL + SELECT t4.c2 + FROM t4 + WHERE t4.c1 > t1.c2) + INTERSECT DISTINCT + (SELECT t4.c1 + FROM t4 + WHERE t4.c1 <= t1.c2 + EXCEPT ALL + SELECT t4.c2 + FROM t4 + WHERE t4.c1 <> t1.c1) +) +-- !query schema +struct +-- !query output + + + +-- !query +SELECT * FROM t1 JOIN LATERAL (SELECT sum(c1) FROM + (SELECT * + FROM t2 + WHERE t2.c1 <= t1.c1) lhs + LEFT SEMI JOIN + (SELECT * + FROM t4) rhs + ON lhs.c1 <=> rhs.c1 and lhs.c2 <=> rhs.c2 +) +-- !query schema +struct +-- !query output +0 1 0 +1 2 0 + + +-- !query +SELECT * FROM t1 JOIN LATERAL (SELECT sum(c1) FROM + (SELECT * + FROM t2 + WHERE t2.c1 <= t1.c1) lhs + LEFT SEMI JOIN + (SELECT * + FROM t4 + WHERE t4.c1 > t1.c2) rhs + ON lhs.c1 <=> rhs.c1 and lhs.c2 <=> rhs.c2 +) +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.AnalysisException +{ + "errorClass" : "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED", Review Comment: Yes, expected. This case is Semi join with correlation on right side, which is unsupported (it's commented in the query input file, is there a way to get those comments to show up in the output file too for easier visibility?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jchen5 commented on a diff in pull request #39759: [SPARK-36124][SQL] Support subqueries with correlation through INTERSECT/EXCEPT
jchen5 commented on code in PR #39759: URL: https://github.com/apache/spark/pull/39759#discussion_r1110500495 ## sql/core/src/test/resources/sql-tests/results/join-lateral.sql.out: ## @@ -1120,6 +1258,105 @@ struct 1 2 3 +-- !query +SELECT * FROM t1 JOIN LATERAL + ((SELECT t2.c2 + FROM t2 + WHERE t2.c1 = t1.c1 + EXCEPT DISTINCT + SELECT t4.c2 + FROM t4 + WHERE t4.c1 > t1.c2) + UNION DISTINCT + (SELECT t4.c1 + FROM t4 + WHERE t4.c1 <= t1.c2 + INTERSECT ALL + SELECT t4.c2 + FROM t4 + WHERE t4.c1 <> t1.c1) +) +-- !query schema +struct +-- !query output +0 1 1 +0 1 2 +0 1 3 +1 2 1 + + +-- !query +SELECT * FROM t1 JOIN LATERAL + ((SELECT t2.c2 + FROM t2 + WHERE t2.c1 = t1.c1 + UNION ALL + SELECT t4.c2 + FROM t4 + WHERE t4.c1 > t1.c2) + INTERSECT DISTINCT + (SELECT t4.c1 + FROM t4 + WHERE t4.c1 <= t1.c2 + EXCEPT ALL + SELECT t4.c2 + FROM t4 + WHERE t4.c1 <> t1.c1) +) +-- !query schema +struct +-- !query output + + + +-- !query +SELECT * FROM t1 JOIN LATERAL (SELECT sum(c1) FROM + (SELECT * + FROM t2 + WHERE t2.c1 <= t1.c1) lhs + LEFT SEMI JOIN + (SELECT * + FROM t4) rhs + ON lhs.c1 <=> rhs.c1 and lhs.c2 <=> rhs.c2 +) +-- !query schema +struct +-- !query output +0 1 0 +1 2 0 + + +-- !query +SELECT * FROM t1 JOIN LATERAL (SELECT sum(c1) FROM + (SELECT * + FROM t2 + WHERE t2.c1 <= t1.c1) lhs + LEFT SEMI JOIN + (SELECT * + FROM t4 + WHERE t4.c1 > t1.c2) rhs + ON lhs.c1 <=> rhs.c1 and lhs.c2 <=> rhs.c2 +) +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.AnalysisException +{ + "errorClass" : "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED", Review Comment: Yes, this case is Semi join with correlation on right side, which is unsupported (it's commented in the query input file, is there a way to get those comments to show up in the output file too for easier visibility?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results
dongjoon-hyun commented on code in PR #40072: URL: https://github.com/apache/spark/pull/40072#discussion_r1110497257 ## sql/core/benchmarks/DataSourceReadBenchmark-results.txt: ## @@ -2,430 +2,430 @@ SQL Single Numeric Column Scan -OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure +OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1031-azure Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz SQL Single BOOLEAN Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -SQL CSV 10433 10554 172 1.5 663.3 1.0X -SQL Json 7948 7990 60 2.0 505.3 1.3X -SQL Parquet Vectorized: DataPageV1 126149 22125.2 8.0 83.0X -SQL Parquet Vectorized: DataPageV2 99113 17158.6 6.3 105.2X -SQL Parquet MR: DataPageV1 1777 1784 9 8.8 113.0 5.9X -SQL Parquet MR: DataPageV2 1579 1583 6 10.0 100.4 6.6X -SQL ORC Vectorized 158165 5 99.7 10.0 66.1X -SQL ORC MR 1654 1661 9 9.5 105.2 6.3X - -OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure +SQL CSV 13143 13363 311 1.2 835.6 1.0X Review Comment: CSV seems to become 30% slower. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] allisonwang-db commented on a diff in pull request #39759: [SPARK-36124][SQL] Support subqueries with correlation through INTERSECT/EXCEPT
allisonwang-db commented on code in PR #39759: URL: https://github.com/apache/spark/pull/39759#discussion_r1110473587 ## sql/core/src/test/resources/sql-tests/results/join-lateral.sql.out: ## @@ -1120,6 +1258,105 @@ struct 1 2 3 +-- !query +SELECT * FROM t1 JOIN LATERAL + ((SELECT t2.c2 + FROM t2 + WHERE t2.c1 = t1.c1 + EXCEPT DISTINCT + SELECT t4.c2 + FROM t4 + WHERE t4.c1 > t1.c2) + UNION DISTINCT + (SELECT t4.c1 + FROM t4 + WHERE t4.c1 <= t1.c2 + INTERSECT ALL + SELECT t4.c2 + FROM t4 + WHERE t4.c1 <> t1.c1) +) +-- !query schema +struct +-- !query output +0 1 1 +0 1 2 +0 1 3 +1 2 1 + + +-- !query +SELECT * FROM t1 JOIN LATERAL + ((SELECT t2.c2 + FROM t2 + WHERE t2.c1 = t1.c1 + UNION ALL + SELECT t4.c2 + FROM t4 + WHERE t4.c1 > t1.c2) + INTERSECT DISTINCT + (SELECT t4.c1 + FROM t4 + WHERE t4.c1 <= t1.c2 + EXCEPT ALL + SELECT t4.c2 + FROM t4 + WHERE t4.c1 <> t1.c1) +) +-- !query schema +struct +-- !query output + + + +-- !query +SELECT * FROM t1 JOIN LATERAL (SELECT sum(c1) FROM + (SELECT * + FROM t2 + WHERE t2.c1 <= t1.c1) lhs + LEFT SEMI JOIN + (SELECT * + FROM t4) rhs + ON lhs.c1 <=> rhs.c1 and lhs.c2 <=> rhs.c2 +) +-- !query schema +struct +-- !query output +0 1 0 +1 2 0 + + +-- !query +SELECT * FROM t1 JOIN LATERAL (SELECT sum(c1) FROM + (SELECT * + FROM t2 + WHERE t2.c1 <= t1.c1) lhs + LEFT SEMI JOIN + (SELECT * + FROM t4 + WHERE t4.c1 > t1.c2) rhs + ON lhs.c1 <=> rhs.c1 and lhs.c2 <=> rhs.c2 +) +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.AnalysisException +{ + "errorClass" : "UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.ACCESSING_OUTER_QUERY_COLUMN_IS_NOT_ALLOWED", Review Comment: Is this expected? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results
dongjoon-hyun commented on code in PR #40072: URL: https://github.com/apache/spark/pull/40072#discussion_r1110473324 ## sql/core/benchmarks/SortBenchmark-jdk17-results.txt: ## @@ -2,15 +2,15 @@ radix sort -OpenJDK 64-Bit Server VM 17.0.5+8 on Linux 5.15.0-1023-azure +OpenJDK 64-Bit Server VM 17.0.6+10 on Linux 5.15.0-1031-azure Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz radix sort 2500: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -reference TimSort key prefix array12059 12071 16 2.1 482.4 1.0X -reference Arrays.sort 2864 2887 33 8.7 114.5 4.2X -radix sort one byte 197203 8126.8 7.9 61.1X -radix sort two bytes373375 2 66.9 14.9 32.3X -radix sort eight bytes 1415 1417 4 17.7 56.6 8.5X -radix sort key prefix array1930 1966 51 13.0 77.2 6.2X +reference TimSort key prefix array12111 12128 23 2.1 484.4 1.0X +reference Arrays.sort 2861 2885 35 8.7 114.4 4.2X +radix sort one byte 197197 0127.0 7.9 61.5X +radix sort two bytes371372 0 67.4 14.8 32.6X +radix sort eight bytes 1391 1397 8 18.0 55.7 8.7X +radix sort key prefix array1914 1951 52 13.1 76.6 6.3X Review Comment: In this benchmark, all Java 17 results are faster than Java 8. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results
dongjoon-hyun commented on code in PR #40072: URL: https://github.com/apache/spark/pull/40072#discussion_r1110471680 ## sql/core/benchmarks/TPCDSQueryBenchmark-jdk11-results.txt: ## @@ -1,810 +1,810 @@ -OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure +OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz TPCDS Snappy: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -q1 1772 1905 188 0.33841.1 1.0X +q1 1888 2074 263 0.24092.0 1.0X -OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure +OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz TPCDS Snappy: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -q2 1686 1696 15 1.3 755.2 1.0X +q2 1585 1899 444 1.4 710.1 1.0X -OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure +OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz TPCDS Snappy: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -q3 718759 41 4.1 241.8 1.0X +q3 996 1035 55 3.0 335.3 1.0X Review Comment: Maybe, slower? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results
dongjoon-hyun commented on code in PR #40072: URL: https://github.com/apache/spark/pull/40072#discussion_r1110470216 ## sql/core/benchmarks/UpdateFieldsBenchmark-results.txt: ## @@ -2,25 +2,25 @@ Add 2 columns and drop 2 columns at 3 different depths of nesting -OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure +OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1031-azure Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz Add 2 columns and drop 2 columns at 3 different depths of nesting: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative - -To non-nullable StructTypes using performant method 4 6 3 0.0 Infinity 1.0X -To nullable StructTypes using performant method 3 4 1 0.0 Infinity 1.3X -To non-nullable StructTypes using non-performant method 54 63 5 0.0 Infinity 0.1X -To nullable StructTypes using non-performant method 2002 2091 127 0.0 Infinity 0.0X +To non-nullable StructTypes using performant method 6 8 3 0.0 Infinity 1.0X +To nullable StructTypes using performant method 4 5 2 0.0 Infinity 1.4X +To non-nullable StructTypes using non-performant method 68 73 5 0.0 Infinity 0.1X +To nullable StructTypes using non-performant method 2223 2452 324 0.0 Infinity 0.0X Add 50 columns and drop 50 columns at 100 different depths of nesting -OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure +OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1031-azure Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz Add 50 columns and drop 50 columns at 100 different depths of nesting: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative - -To non-nullable StructTypes using performant method 5520 5639 168 0.0 Infinity 1.0X -To nullable StructTypes using performant method 2657 2708 72 0.0 Infinity 2.1X +To non-nullable StructTypes using performant method 3126 3150 34 0.0 Infinity 1.0X +To nullable StructTypes using performant method 3136 47682309 0.0 Infinity 1.0X Review Comment: This looks like a regression in Java 8. We need to take a look at this later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results
dongjoon-hyun commented on code in PR #40072: URL: https://github.com/apache/spark/pull/40072#discussion_r1110470216 ## sql/core/benchmarks/UpdateFieldsBenchmark-results.txt: ## @@ -2,25 +2,25 @@ Add 2 columns and drop 2 columns at 3 different depths of nesting -OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure +OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1031-azure Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz Add 2 columns and drop 2 columns at 3 different depths of nesting: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative - -To non-nullable StructTypes using performant method 4 6 3 0.0 Infinity 1.0X -To nullable StructTypes using performant method 3 4 1 0.0 Infinity 1.3X -To non-nullable StructTypes using non-performant method 54 63 5 0.0 Infinity 0.1X -To nullable StructTypes using non-performant method 2002 2091 127 0.0 Infinity 0.0X +To non-nullable StructTypes using performant method 6 8 3 0.0 Infinity 1.0X +To nullable StructTypes using performant method 4 5 2 0.0 Infinity 1.4X +To non-nullable StructTypes using non-performant method 68 73 5 0.0 Infinity 0.1X +To nullable StructTypes using non-performant method 2223 2452 324 0.0 Infinity 0.0X Add 50 columns and drop 50 columns at 100 different depths of nesting -OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure +OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1031-azure Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz Add 50 columns and drop 50 columns at 100 different depths of nesting: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative - -To non-nullable StructTypes using performant method 5520 5639 168 0.0 Infinity 1.0X -To nullable StructTypes using performant method 2657 2708 72 0.0 Infinity 2.1X +To non-nullable StructTypes using performant method 3126 3150 34 0.0 Infinity 1.0X +To nullable StructTypes using performant method 3136 47682309 0.0 Infinity 1.0X Review Comment: This looks like a regression. We need to take a look at this later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhenlineo commented on a diff in pull request #40061: [SPARK-42482][CONNECT] Scala Client Write API V1
zhenlineo commented on code in PR #40061: URL: https://github.com/apache/spark/pull/40061#discussion_r1110470096 ## connector/connect/client/jvm/src/main/java/org/apache/spark/sql/SaveMode.java: ## @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql; + +import org.apache.spark.annotation.Stable; + +/** + * SaveMode is used to specify the expected behavior of saving a DataFrame to a data source. + * + * @since 1.3.0 + */ +@Stable +public enum SaveMode { Review Comment: I checked the MiMaExcludes file and find if we want to break spark-sql back compatibility, we need to have a very good reason to do so. So I leave this out of this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhenlineo commented on a diff in pull request #40075: [WIP] [CONNECT] Scala Client DataFrameWriterV2
zhenlineo commented on code in PR #40075: URL: https://github.com/apache/spark/pull/40075#discussion_r1110468795 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala: ## @@ -156,6 +156,19 @@ class ClientE2ETestSuite extends RemoteSparkSession { } } + test("write v2") { +try { + spark.range(3).writeTo("myTableV2").using("parquet").create() Review Comment: This API feels wrong, a few errors I get with this API: ``` spark.range(3).writeTo("myTableV2").create() -> Error: should specify source ``` ``` spark.range(3).writeTo("myTableV2").using("parquet").createOrReplace() -> Error: dose not support create or select. ``` ``` spark.range(3).writeTo("myTableV2").using("parquet").create() spark.range(3).writeTo("myTableV2").append() -> Error: Cannot write into v1 table ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #39967: [SPARK-42395][K8S]The code logic of the configmap max size validation lacks extra content
srowen commented on PR #39967: URL: https://github.com/apache/spark/pull/39967#issuecomment-1435430725 I don't know enough about K8S to review this. Seems harmless -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ninebigbig commented on pull request #39967: [SPARK-42395][K8S]The code logic of the configmap max size validation lacks extra content
ninebigbig commented on PR #39967: URL: https://github.com/apache/spark/pull/39967#issuecomment-1435430134 Can I take your free time to help me to review this pr please?@dongjoon-hyun @HyukjinKwon @srowen @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin opened a new pull request, #40076: [SPARK-42048][PYTHON][CONNECT] Fix the alias name for numpy literals
ueshin opened a new pull request, #40076: URL: https://github.com/apache/spark/pull/40076 ### What changes were proposed in this pull request? Fixes the alias name for numpy literals. Also fixes `F.lit` in Spark Connect to support `np.bool_` objects. ### Why are the changes needed? Currently the alias name for literals created from numpy scalars contains something like `CAST(` ... `AS )`, but it should be removed and return only the value string as same as literals from Python numbers. ### Does this PR introduce _any_ user-facing change? The alias name will be changed. ### How was this patch tested? Modifed/enabled related tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhenlineo opened a new pull request, #40075: [WIP] [CONNECT] Scala Client DataFrameWriterV2
zhenlineo opened a new pull request, #40075: URL: https://github.com/apache/spark/pull/40075 ### What changes were proposed in this pull request? Adding DataFrameWriterV2 ### Why are the changes needed? Impls Dataset#writeTo ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? E2E This is based on https://github.com/apache/spark/pull/40061 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell closed pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
hvanhovell closed pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum} URL: https://github.com/apache/spark/pull/40070 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
hvanhovell commented on PR #40070: URL: https://github.com/apache/spark/pull/40070#issuecomment-1435428773 Merging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang opened a new pull request, #40074: [SPARK-42430][DOC][FOLLOW-UP] Revise the java doc for TimestampNTZ & ANSI interval types
gengliangwang opened a new pull request, #40074: URL: https://github.com/apache/spark/pull/40074 ### What changes were proposed in this pull request? As https://github.com/apache/spark/pull/40005#pullrequestreview-1299089504 pointed out, the java doc for data type recommends using factory methods provided in org.apache.spark.sql.types.DataTypes. Since the ANSI interval types missed the `DataTypes` as well, this PR also revise their doc. ### Why are the changes needed? Unify the data type doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Local preview https://user-images.githubusercontent.com/1097932/219821685-321c2fd1-6248-4930-9c61-eec68f0dcb50.png";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results
dongjoon-hyun commented on code in PR #40072: URL: https://github.com/apache/spark/pull/40072#discussion_r1110464359 ## sql/catalyst/benchmarks/HashBenchmark-jdk11-results.txt: ## @@ -2,69 +2,69 @@ single ints -OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure +OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz Hash For single ints: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -interpreted version3763 3769 8142.7 7.0 1.0X -codegen version4658 4662 5115.3 8.7 0.8X -codegen version 64-bit 4706 4710 6114.1 8.8 0.8X -codegen HiveHash version 3998 3998 0134.3 7.4 0.9X +interpreted version4933 4935 2108.8 9.2 1.0X +codegen version5135 5141 9104.6 9.6 1.0X +codegen version 64-bit 5071 5079 10105.9 9.4 1.0X +codegen HiveHash version 4326 4326 0124.1 8.1 1.1X Review Comment: Now, this is the fastest. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results
dongjoon-hyun commented on code in PR #40072: URL: https://github.com/apache/spark/pull/40072#discussion_r1110463604 ## sql/catalyst/benchmarks/EnumTypeSetBenchmark-jdk11-results.txt: ## @@ -1,105 +1,105 @@ -OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure -Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz +OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure +Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz Test contains use empty Set: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Use HashSet 4 4 0226.9 4.4 1.0X -Use EnumSet 1 1 0737.3 1.4 3.2X +Use HashSet 0 1 0 2440.2 0.4 1.0X +Use EnumSet 1 1 0884.8 1.1 0.4X Review Comment: `HashSet` seems to get some improvements in this case, `contains use empty Set:`. The other cases looks in a reasonable range. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results
dongjoon-hyun commented on code in PR #40072: URL: https://github.com/apache/spark/pull/40072#discussion_r1110463981 ## sql/catalyst/benchmarks/EnumTypeSetBenchmark-results.txt: ## @@ -1,105 +1,105 @@ -OpenJDK 64-Bit Server VM 1.8.0_352-b08 on Linux 5.15.0-1023-azure +OpenJDK 64-Bit Server VM 1.8.0_362-b09 on Linux 5.15.0-1031-azure Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz Test contains use empty Set: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Use HashSet 5 5 0209.4 4.8 1.0X -Use EnumSet 2 2 0459.8 2.2 2.2X +Use HashSet 1 1 1 1972.0 0.5 1.0X +Use EnumSet 2 2 0444.0 2.3 0.2X Review Comment: ditto. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results
dongjoon-hyun commented on code in PR #40072: URL: https://github.com/apache/spark/pull/40072#discussion_r1110463604 ## sql/catalyst/benchmarks/EnumTypeSetBenchmark-jdk11-results.txt: ## @@ -1,105 +1,105 @@ -OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure -Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz +OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure +Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz Test contains use empty Set: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Use HashSet 4 4 0226.9 4.4 1.0X -Use EnumSet 1 1 0737.3 1.4 3.2X +Use HashSet 0 1 0 2440.2 0.4 1.0X +Use EnumSet 1 1 0884.8 1.1 0.4X Review Comment: `HashSet` seems to get some improvements in this case, `contains use empty Set:`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results
dongjoon-hyun commented on code in PR #40072: URL: https://github.com/apache/spark/pull/40072#discussion_r1110463604 ## sql/catalyst/benchmarks/EnumTypeSetBenchmark-jdk11-results.txt: ## @@ -1,105 +1,105 @@ -OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure -Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz +OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure +Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz Test contains use empty Set: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Use HashSet 4 4 0226.9 4.4 1.0X -Use EnumSet 1 1 0737.3 1.4 3.2X +Use HashSet 0 1 0 2440.2 0.4 1.0X +Use EnumSet 1 1 0884.8 1.1 0.4X Review Comment: `HashSet` seems to get some improvements. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results
dongjoon-hyun commented on code in PR #40072: URL: https://github.com/apache/spark/pull/40072#discussion_r1110461118 ## sql/catalyst/benchmarks/EnumTypeSetBenchmark-jdk11-results.txt: ## @@ -1,105 +1,105 @@ -OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure -Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz +OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure +Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz Test contains use empty Set: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -Use HashSet 4 4 0226.9 4.4 1.0X -Use EnumSet 1 1 0737.3 1.4 3.2X +Use HashSet 0 1 0 2440.2 0.4 1.0X +Use EnumSet 1 1 0884.8 1.1 0.4X Review Comment: We need to investigate this reversed ratio. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results
dongjoon-hyun commented on code in PR #40072: URL: https://github.com/apache/spark/pull/40072#discussion_r1110460334 ## core/benchmarks/ZStandardBenchmark-jdk11-results.txt: ## @@ -2,26 +2,26 @@ Benchmark ZStandardCompressionCodec -OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure -Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz +OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure +Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Compression 1 times at level 1 without buffer pool859 872 21 0.0 85890.3 1.0X -Compression 1 times at level 2 without buffer pool930 932 2 0.0 92995.6 0.9X -Compression 1 times at level 3 without buffer pool 1137 1138 2 0.0 113664.6 0.8X -Compression 1 times at level 1 with buffer pool 662 664 1 0.0 66244.7 1.3X -Compression 1 times at level 2 with buffer pool 725 726 1 0.0 72541.4 1.2X -Compression 1 times at level 3 with buffer pool 929 930 2 0.0 92851.4 0.9X +Compression 1 times at level 1 without buffer pool605 812 220 0.0 60521.0 1.0X +Compression 1 times at level 2 without buffer pool665 678 20 0.0 66512.5 0.9X +Compression 1 times at level 3 without buffer pool890 903 20 0.0 88961.3 0.7X +Compression 1 times at level 1 with buffer pool 829 839 11 0.0 82940.2 0.7X Review Comment: Java 8/17 doesn't have this regression. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #38263: [SPARK-40692][SQL] Support data masking built-in function 'mask_hash'
github-actions[bot] commented on PR #38263: URL: https://github.com/apache/spark/pull/38263#issuecomment-1435417625 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #38505: [SPARK-40622][WIP]do not merge(try to fix build error)
github-actions[bot] commented on PR #38505: URL: https://github.com/apache/spark/pull/38505#issuecomment-1435417586 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #37362: [SPARK-39950][SQL] It's unnecessary to materialize BroadcastQueryStage firstly, because the BroadcastQueryStage does not timeout in AQE
github-actions[bot] closed pull request #37362: [SPARK-39950][SQL] It's unnecessary to materialize BroadcastQueryStage firstly, because the BroadcastQueryStage does not timeout in AQE. URL: https://github.com/apache/spark/pull/37362 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #38434: [SPARK-40946][SQL] Add a new DataSource V2 interface SupportsPushDownClusterKeys
github-actions[bot] commented on PR #38434: URL: https://github.com/apache/spark/pull/38434#issuecomment-1435417608 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40072: [SPARK-42483][TESTS] Regenerate benchmark results
dongjoon-hyun commented on code in PR #40072: URL: https://github.com/apache/spark/pull/40072#discussion_r1110460125 ## core/benchmarks/ZStandardBenchmark-jdk11-results.txt: ## @@ -2,26 +2,26 @@ Benchmark ZStandardCompressionCodec -OpenJDK 64-Bit Server VM 11.0.17+8 on Linux 5.15.0-1023-azure -Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz +OpenJDK 64-Bit Server VM 11.0.18+10 on Linux 5.15.0-1031-azure +Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz Benchmark ZStandardCompressionCodec:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative -- -Compression 1 times at level 1 without buffer pool859 872 21 0.0 85890.3 1.0X -Compression 1 times at level 2 without buffer pool930 932 2 0.0 92995.6 0.9X -Compression 1 times at level 3 without buffer pool 1137 1138 2 0.0 113664.6 0.8X -Compression 1 times at level 1 with buffer pool 662 664 1 0.0 66244.7 1.3X -Compression 1 times at level 2 with buffer pool 725 726 1 0.0 72541.4 1.2X -Compression 1 times at level 3 with buffer pool 929 930 2 0.0 92851.4 0.9X +Compression 1 times at level 1 without buffer pool605 812 220 0.0 60521.0 1.0X +Compression 1 times at level 2 without buffer pool665 678 20 0.0 66512.5 0.9X +Compression 1 times at level 3 without buffer pool890 903 20 0.0 88961.3 0.7X +Compression 1 times at level 1 with buffer pool 829 839 11 0.0 82940.2 0.7X Review Comment: I'll take a look at this after this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WweiL opened a new pull request, #40073: [SPARK-42484] UnsafeRowUtils better error message
WweiL opened a new pull request, #40073: URL: https://github.com/apache/spark/pull/40073 ### What changes were proposed in this pull request? Showing the essential information when throwing `InvalidUnsafeRowException`. Including where the check failed, and status of the `unsafeRow` and `expctedSchema` Example output: ``` [UnsafeRowStatus] expectedSchema: StructType(StructField(key1,IntegerType,false),StructField(key2,IntegerType,false),StructField(sum(key1),IntegerType,false),StructField(sum(key2),IntegerType,false)), expectedSchemaNumFields: 4, numFields: 4, bitSetWidthInBytes: 8, rowSizeInBytes: 40 fieldStatus: [UnsafeRowFieldStatus] index: 0, expectedFieldType: IntegerType, isNull: false, isFixedLength: true, offset: -1, size: -1 [UnsafeRowFieldStatus] index: 1, expectedFieldType: IntegerType, isNull: false, isFixedLength: true, offset: -1, size: -1 [UnsafeRowFieldStatus] index: 2, expectedFieldType: IntegerType, isNull: false, isFixedLength: true, offset: -1, size: -1 [UnsafeRowFieldStatus] index: 3, expectedFieldType: IntegerType, isNull: false, isFixedLength: true, offset: -1, size: -1 ``` ### Why are the changes needed? Right now if such error happens, it's hard to track where it errored, and what the misbehaved row & schema looks like. With this change these information are more clear. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun opened a new pull request, #40072: [SPARK-42483][TESTS] Regenerate benchmark results
dongjoon-hyun opened a new pull request, #40072: URL: https://github.com/apache/spark/pull/40072 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
amaliujia commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110400229 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -109,7 +109,7 @@ class RelationalGroupedDataset protected[sql] ( agg(exprs.asScala.toMap) } - private[this] def strToExpr(expr: String, inputExpr: proto.Expression): proto.Expression = { + private[this] def strToColumn(expr: String, inputExpr: proto.Expression): Column = { Review Comment: I see what you are suggesting now. Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
hvanhovell commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110388531 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2065,6 +2065,15 @@ class Dataset[T] private[sql] (val session: SparkSession, private[sql] val plan: collectResult().iterator.asInstanceOf[java.util.Iterator[T]] } + /** + * Returns the number of rows in the Dataset. + * @group action + * @since 3.4.0 + */ + def count(): Long = { +groupBy().count().collect().head.getLong(0) Review Comment: Well you are right... I thought I added it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
hvanhovell commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110385162 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -109,7 +109,7 @@ class RelationalGroupedDataset protected[sql] ( agg(exprs.asScala.toMap) } - private[this] def strToExpr(expr: String, inputExpr: proto.Expression): proto.Expression = { + private[this] def strToColumn(expr: String, inputExpr: proto.Expression): Column = { Review Comment: How about: ```scala private[this] def strToColumn(expr: String, inputExpr: Column): Column = { expr.toLowerCase(Locale.ROOT) match { case "avg" | "average" | "mean" => functions.avg(inputExpr) case "stddev" | "std" => functions.avg(inputExpr) case "count" | "size" => functions.count(inputExpr) // Analyzer will take care of * expansion case name => Column.fn(name, inputExpr) } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
amaliujia commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110380640 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -149,4 +149,111 @@ class RelationalGroupedDataset protected[sql] ( } builder.build() } + + /** + * Compute aggregates by specifying a series of aggregate columns. Note that this function by + * default retains the grouping columns in its output. To not retain grouping columns, set + * `spark.sql.retainGroupColumns` to false. + * + * The available aggregate methods are defined in [[org.apache.spark.sql.functions]]. + * + * {{{ + * // Selects the age of the oldest employee and the aggregate expense for each department + * + * // Scala: + * import org.apache.spark.sql.functions._ + * df.groupBy("department").agg(max("age"), sum("expense")) + * + * // Java: + * import static org.apache.spark.sql.functions.*; + * df.groupBy("department").agg(max("age"), sum("expense")); + * }}} + * + * Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change + * to that behavior, set config variable `spark.sql.retainGroupColumns` to `false`. + * {{{ + * // Scala, 1.3.x: + * df.groupBy("department").agg($"department", max("age"), sum("expense")) + * + * // Java, 1.3.x: + * df.groupBy("department").agg(col("department"), max("age"), sum("expense")); + * }}} + * + * @since 3.4.0 + */ + @scala.annotation.varargs + def agg(expr: Column, exprs: Column*): DataFrame = { +toDF((expr +: exprs).map { case c => + c +// TODO: deal with typed columns. +}) + } + + /** + * Count the number of rows for each group. The resulting `DataFrame` will also contain the + * grouping columns. + * + * @since 3.4.0 + */ + def count(): DataFrame = toDF(Seq(functions.count(functions.lit(1)).alias("count"))) + + /** + * Compute the average value for each numeric columns for each group. This is an alias for + * `avg`. The resulting `DataFrame` will also contain the grouping columns. When specified + * columns are given, only compute the average values for them. + * + * @since 3.4.0 + */ + @scala.annotation.varargs + def mean(colNames: String*): DataFrame = { +toDF(colNames.map(colName => functions.mean(colName)).toSeq) Review Comment: hmmm I see. Removing those `toSeq`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40065: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.5
dongjoon-hyun commented on PR #40065: URL: https://github.com/apache/spark/pull/40065#issuecomment-1435279944 If you don't mind, please allow me one or two days. I'll check this during weekend~ Thank you for your patience always. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
amaliujia commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110378209 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] ( agg(exprs.asScala.toMap) } - private[this] def strToExpr(expr: String, inputExpr: proto.Expression): proto.Expression = { + private[this] def strToExpr(expr: String, columnName: String): proto.Expression = { val builder = proto.Expression.newBuilder() expr.toLowerCase(Locale.ROOT) match { // We special handle a few cases that have alias that are not in function registry. case "avg" | "average" | "mean" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("avg") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.avg(columnName) Review Comment: I think I did right replacement and hit a proto -> plan test generation failure. I am planing look into that separately. I am gonna need some time to learn how to debug `org.apache.spark.sql.connect.ProtoToParsedPlanTestSuite` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
hvanhovell commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110376126 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -149,4 +149,111 @@ class RelationalGroupedDataset protected[sql] ( } builder.build() } + + /** + * Compute aggregates by specifying a series of aggregate columns. Note that this function by + * default retains the grouping columns in its output. To not retain grouping columns, set + * `spark.sql.retainGroupColumns` to false. + * + * The available aggregate methods are defined in [[org.apache.spark.sql.functions]]. + * + * {{{ + * // Selects the age of the oldest employee and the aggregate expense for each department + * + * // Scala: + * import org.apache.spark.sql.functions._ + * df.groupBy("department").agg(max("age"), sum("expense")) + * + * // Java: + * import static org.apache.spark.sql.functions.*; + * df.groupBy("department").agg(max("age"), sum("expense")); + * }}} + * + * Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change + * to that behavior, set config variable `spark.sql.retainGroupColumns` to `false`. + * {{{ + * // Scala, 1.3.x: + * df.groupBy("department").agg($"department", max("age"), sum("expense")) + * + * // Java, 1.3.x: + * df.groupBy("department").agg(col("department"), max("age"), sum("expense")); + * }}} + * + * @since 3.4.0 + */ + @scala.annotation.varargs + def agg(expr: Column, exprs: Column*): DataFrame = { +toDF((expr +: exprs).map { case c => + c +// TODO: deal with typed columns. +}) + } + + /** + * Count the number of rows for each group. The resulting `DataFrame` will also contain the + * grouping columns. + * + * @since 3.4.0 + */ + def count(): DataFrame = toDF(Seq(functions.count(functions.lit(1)).alias("count"))) + + /** + * Compute the average value for each numeric columns for each group. This is an alias for + * `avg`. The resulting `DataFrame` will also contain the grouping columns. When specified + * columns are given, only compute the average values for them. + * + * @since 3.4.0 + */ + @scala.annotation.varargs + def mean(colNames: String*): DataFrame = { +toDF(colNames.map(colName => functions.mean(colName)).toSeq) Review Comment: do we need `toSeq` here? I though scala varags are always a Seq... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
amaliujia commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110375466 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2065,6 +2065,15 @@ class Dataset[T] private[sql] (val session: SparkSession, private[sql] val plan: collectResult().iterator.asInstanceOf[java.util.Iterator[T]] } + /** + * Returns the number of rows in the Dataset. + * @group action + * @since 3.4.0 + */ + def count(): Long = { +groupBy().count().collect().head.getLong(0) Review Comment: In my local branch that I rebased today, there is no this API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
amaliujia commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110374201 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala: ## @@ -1199,6 +1199,34 @@ class PlanGenerationTestSuite extends ConnectFunSuite with BeforeAndAfterAll wit "b" -> "avg", "*" -> "size", "a" -> "count") + +simple Review Comment: done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
hvanhovell commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110374514 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2065,6 +2065,15 @@ class Dataset[T] private[sql] (val session: SparkSession, private[sql] val plan: collectResult().iterator.asInstanceOf[java.util.Iterator[T]] } + /** + * Returns the number of rows in the Dataset. + * @group action + * @since 3.4.0 + */ + def count(): Long = { +groupBy().count().collect().head.getLong(0) Review Comment: Didn't I implement that? lol... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
amaliujia commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110372332 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] ( agg(exprs.asScala.toMap) } - private[this] def strToExpr(expr: String, inputExpr: proto.Expression): proto.Expression = { + private[this] def strToExpr(expr: String, columnName: String): proto.Expression = { val builder = proto.Expression.newBuilder() expr.toLowerCase(Locale.ROOT) match { // We special handle a few cases that have alias that are not in function registry. case "avg" | "average" | "mean" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("avg") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.avg(columnName) case "stddev" | "std" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("stddev") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.stddev(columnName) // Also special handle count because we need to take care count(*). case "count" | "size" => -// Turn count(*) into count(1) -inputExpr match { - case s if s.hasUnresolvedStar => -val exprBuilder = proto.Expression.newBuilder -exprBuilder.getLiteralBuilder.setInteger(1) -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(exprBuilder) - .setIsDistinct(false) - case _ => -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(inputExpr) - .setIsDistinct(false) -} +functions.col(columnName) case name => builder.getUnresolvedFunctionBuilder .setFunctionName(name) - .addArguments(inputExpr) + .addArguments(df(columnName).expr) .setIsDistinct(false) } builder.build() } + + /** + * Compute aggregates by specifying a series of aggregate columns. Note that this function by + * default retains the grouping columns in its output. To not retain grouping columns, set + * `spark.sql.retainGroupColumns` to false. + * + * The available aggregate methods are defined in [[org.apache.spark.sql.functions]]. + * + * {{{ + * // Selects the age of the oldest employee and the aggregate expense for each department + * + * // Scala: + * import org.apache.spark.sql.functions._ + * df.groupBy("department").agg(max("age"), sum("expense")) + * + * // Java: + * import static org.apache.spark.sql.functions.*; + * df.groupBy("department").agg(max("age"), sum("expense")); + * }}} + * + * Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change + * to that behavior, set config variable `spark.sql.retainGroupColumns` to `false`. + * {{{ + * // Scala, 1.3.x: + * df.groupBy("department").agg($"department", max("age"), sum("expense")) + * + * // Java, 1.3.x: + * df.groupBy("department").agg(col("department"), max("age"), sum("expense")); + * }}} + * + * @since 3.4.0 + */ + @scala.annotation.varargs + def agg(expr: Column, exprs: Column*): DataFrame = { +toDF((expr +: exprs).map { case c => Review Comment: Makes sense. Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
amaliujia commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110360552 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] ( agg(exprs.asScala.toMap) } - private[this] def strToExpr(expr: String, inputExpr: proto.Expression): proto.Expression = { + private[this] def strToExpr(expr: String, columnName: String): proto.Expression = { val builder = proto.Expression.newBuilder() expr.toLowerCase(Locale.ROOT) match { // We special handle a few cases that have alias that are not in function registry. case "avg" | "average" | "mean" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("avg") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.avg(columnName) case "stddev" | "std" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("stddev") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.stddev(columnName) // Also special handle count because we need to take care count(*). case "count" | "size" => -// Turn count(*) into count(1) -inputExpr match { - case s if s.hasUnresolvedStar => -val exprBuilder = proto.Expression.newBuilder -exprBuilder.getLiteralBuilder.setInteger(1) -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(exprBuilder) - .setIsDistinct(false) - case _ => -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(inputExpr) - .setIsDistinct(false) -} +functions.col(columnName) case name => builder.getUnresolvedFunctionBuilder .setFunctionName(name) - .addArguments(inputExpr) + .addArguments(df(columnName).expr) .setIsDistinct(false) } builder.build() } + + /** + * Compute aggregates by specifying a series of aggregate columns. Note that this function by + * default retains the grouping columns in its output. To not retain grouping columns, set + * `spark.sql.retainGroupColumns` to false. + * + * The available aggregate methods are defined in [[org.apache.spark.sql.functions]]. + * + * {{{ + * // Selects the age of the oldest employee and the aggregate expense for each department + * + * // Scala: + * import org.apache.spark.sql.functions._ + * df.groupBy("department").agg(max("age"), sum("expense")) + * + * // Java: + * import static org.apache.spark.sql.functions.*; + * df.groupBy("department").agg(max("age"), sum("expense")); + * }}} + * + * Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change + * to that behavior, set config variable `spark.sql.retainGroupColumns` to `false`. + * {{{ + * // Scala, 1.3.x: + * df.groupBy("department").agg($"department", max("age"), sum("expense")) + * + * // Java, 1.3.x: + * df.groupBy("department").agg(col("department"), max("age"), sum("expense")); + * }}} + * + * @since 3.4.0 + */ + @scala.annotation.varargs + def agg(expr: Column, exprs: Column*): DataFrame = { +toDF((expr +: exprs).map { case c => + c.expr +// TODO: deal with typed columns. +}) + } + + /** + * Count the number of rows for each group. The resulting `DataFrame` will also contain the + * grouping columns. + * + * @since 3.4.0 + */ + def count(): DataFrame = toDF(Seq(functions.count(functions.lit(1)).alias("count").expr)) Review Comment: I see what you mean now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
amaliujia commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110359583 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] ( agg(exprs.asScala.toMap) } - private[this] def strToExpr(expr: String, inputExpr: proto.Expression): proto.Expression = { + private[this] def strToExpr(expr: String, columnName: String): proto.Expression = { val builder = proto.Expression.newBuilder() expr.toLowerCase(Locale.ROOT) match { // We special handle a few cases that have alias that are not in function registry. case "avg" | "average" | "mean" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("avg") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.avg(columnName) case "stddev" | "std" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("stddev") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.stddev(columnName) // Also special handle count because we need to take care count(*). case "count" | "size" => -// Turn count(*) into count(1) -inputExpr match { - case s if s.hasUnresolvedStar => -val exprBuilder = proto.Expression.newBuilder -exprBuilder.getLiteralBuilder.setInteger(1) -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(exprBuilder) - .setIsDistinct(false) - case _ => -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(inputExpr) - .setIsDistinct(false) -} +functions.col(columnName) case name => builder.getUnresolvedFunctionBuilder .setFunctionName(name) - .addArguments(inputExpr) + .addArguments(df(columnName).expr) .setIsDistinct(false) } builder.build() } + + /** + * Compute aggregates by specifying a series of aggregate columns. Note that this function by + * default retains the grouping columns in its output. To not retain grouping columns, set + * `spark.sql.retainGroupColumns` to false. + * + * The available aggregate methods are defined in [[org.apache.spark.sql.functions]]. + * + * {{{ + * // Selects the age of the oldest employee and the aggregate expense for each department + * + * // Scala: + * import org.apache.spark.sql.functions._ + * df.groupBy("department").agg(max("age"), sum("expense")) + * + * // Java: + * import static org.apache.spark.sql.functions.*; + * df.groupBy("department").agg(max("age"), sum("expense")); + * }}} + * + * Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change + * to that behavior, set config variable `spark.sql.retainGroupColumns` to `false`. + * {{{ + * // Scala, 1.3.x: + * df.groupBy("department").agg($"department", max("age"), sum("expense")) + * + * // Java, 1.3.x: + * df.groupBy("department").agg(col("department"), max("age"), sum("expense")); + * }}} + * + * @since 3.4.0 + */ + @scala.annotation.varargs + def agg(expr: Column, exprs: Column*): DataFrame = { +toDF((expr +: exprs).map { case c => + c.expr +// TODO: deal with typed columns. +}) + } + + /** + * Count the number of rows for each group. The resulting `DataFrame` will also contain the + * grouping columns. + * + * @since 3.4.0 + */ + def count(): DataFrame = toDF(Seq(functions.count(functions.lit(1)).alias("count").expr)) Review Comment: Wait I think we should do reverse way right? So the Dataset.count = groupby().count().collect()? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
hvanhovell commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110355264 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] ( agg(exprs.asScala.toMap) } - private[this] def strToExpr(expr: String, inputExpr: proto.Expression): proto.Expression = { + private[this] def strToExpr(expr: String, columnName: String): proto.Expression = { val builder = proto.Expression.newBuilder() expr.toLowerCase(Locale.ROOT) match { // We special handle a few cases that have alias that are not in function registry. case "avg" | "average" | "mean" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("avg") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.avg(columnName) case "stddev" | "std" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("stddev") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.stddev(columnName) // Also special handle count because we need to take care count(*). case "count" | "size" => -// Turn count(*) into count(1) -inputExpr match { - case s if s.hasUnresolvedStar => -val exprBuilder = proto.Expression.newBuilder -exprBuilder.getLiteralBuilder.setInteger(1) -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(exprBuilder) - .setIsDistinct(false) - case _ => -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(inputExpr) - .setIsDistinct(false) -} +functions.col(columnName) case name => builder.getUnresolvedFunctionBuilder .setFunctionName(name) - .addArguments(inputExpr) + .addArguments(df(columnName).expr) Review Comment: See my earlier comment, and also the tests are broken. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
hvanhovell commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110354619 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] ( agg(exprs.asScala.toMap) } - private[this] def strToExpr(expr: String, inputExpr: proto.Expression): proto.Expression = { + private[this] def strToExpr(expr: String, columnName: String): proto.Expression = { val builder = proto.Expression.newBuilder() expr.toLowerCase(Locale.ROOT) match { // We special handle a few cases that have alias that are not in function registry. case "avg" | "average" | "mean" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("avg") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.avg(columnName) case "stddev" | "std" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("stddev") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.stddev(columnName) // Also special handle count because we need to take care count(*). case "count" | "size" => -// Turn count(*) into count(1) -inputExpr match { - case s if s.hasUnresolvedStar => -val exprBuilder = proto.Expression.newBuilder -exprBuilder.getLiteralBuilder.setInteger(1) -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(exprBuilder) - .setIsDistinct(false) - case _ => -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(inputExpr) - .setIsDistinct(false) -} +functions.col(columnName) case name => builder.getUnresolvedFunctionBuilder .setFunctionName(name) - .addArguments(inputExpr) + .addArguments(df(columnName).expr) .setIsDistinct(false) } builder.build() } + + /** + * Compute aggregates by specifying a series of aggregate columns. Note that this function by + * default retains the grouping columns in its output. To not retain grouping columns, set + * `spark.sql.retainGroupColumns` to false. + * + * The available aggregate methods are defined in [[org.apache.spark.sql.functions]]. + * + * {{{ + * // Selects the age of the oldest employee and the aggregate expense for each department + * + * // Scala: + * import org.apache.spark.sql.functions._ + * df.groupBy("department").agg(max("age"), sum("expense")) + * + * // Java: + * import static org.apache.spark.sql.functions.*; + * df.groupBy("department").agg(max("age"), sum("expense")); + * }}} + * + * Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change + * to that behavior, set config variable `spark.sql.retainGroupColumns` to `false`. + * {{{ + * // Scala, 1.3.x: + * df.groupBy("department").agg($"department", max("age"), sum("expense")) + * + * // Java, 1.3.x: + * df.groupBy("department").agg(col("department"), max("age"), sum("expense")); + * }}} + * + * @since 3.4.0 + */ + @scala.annotation.varargs + def agg(expr: Column, exprs: Column*): DataFrame = { +toDF((expr +: exprs).map { case c => + c.expr +// TODO: deal with typed columns. +}) + } + + /** + * Count the number of rows for each group. The resulting `DataFrame` will also contain the + * grouping columns. + * + * @since 3.4.0 + */ + def count(): DataFrame = toDF(Seq(functions.count(functions.lit(1)).alias("count").expr)) Review Comment: You can replace the dataframe generation code there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
hvanhovell commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110352767 ## connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala: ## @@ -1199,6 +1199,34 @@ class PlanGenerationTestSuite extends ConnectFunSuite with BeforeAndAfterAll wit "b" -> "avg", "*" -> "size", "a" -> "count") + +simple Review Comment: This doesn't work. You need to construct a single `Dataset` per test, now you are just returning the last one. Can you split this into a bunch of tests? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
amaliujia commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110352041 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] ( agg(exprs.asScala.toMap) } - private[this] def strToExpr(expr: String, inputExpr: proto.Expression): proto.Expression = { + private[this] def strToExpr(expr: String, columnName: String): proto.Expression = { val builder = proto.Expression.newBuilder() expr.toLowerCase(Locale.ROOT) match { // We special handle a few cases that have alias that are not in function registry. case "avg" | "average" | "mean" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("avg") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.avg(columnName) case "stddev" | "std" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("stddev") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.stddev(columnName) // Also special handle count because we need to take care count(*). case "count" | "size" => -// Turn count(*) into count(1) -inputExpr match { - case s if s.hasUnresolvedStar => -val exprBuilder = proto.Expression.newBuilder -exprBuilder.getLiteralBuilder.setInteger(1) -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(exprBuilder) - .setIsDistinct(false) - case _ => -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(inputExpr) - .setIsDistinct(false) -} +functions.col(columnName) case name => builder.getUnresolvedFunctionBuilder .setFunctionName(name) - .addArguments(inputExpr) + .addArguments(df(columnName).expr) .setIsDistinct(false) } builder.build() } + + /** + * Compute aggregates by specifying a series of aggregate columns. Note that this function by + * default retains the grouping columns in its output. To not retain grouping columns, set + * `spark.sql.retainGroupColumns` to false. + * + * The available aggregate methods are defined in [[org.apache.spark.sql.functions]]. + * + * {{{ + * // Selects the age of the oldest employee and the aggregate expense for each department + * + * // Scala: + * import org.apache.spark.sql.functions._ + * df.groupBy("department").agg(max("age"), sum("expense")) + * + * // Java: + * import static org.apache.spark.sql.functions.*; + * df.groupBy("department").agg(max("age"), sum("expense")); + * }}} + * + * Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change + * to that behavior, set config variable `spark.sql.retainGroupColumns` to `false`. + * {{{ + * // Scala, 1.3.x: + * df.groupBy("department").agg($"department", max("age"), sum("expense")) + * + * // Java, 1.3.x: + * df.groupBy("department").agg(col("department"), max("age"), sum("expense")); + * }}} + * + * @since 3.4.0 + */ + @scala.annotation.varargs + def agg(expr: Column, exprs: Column*): DataFrame = { +toDF((expr +: exprs).map { case c => + c.expr +// TODO: deal with typed columns. +}) + } + + /** + * Count the number of rows for each group. The resulting `DataFrame` will also contain the + * grouping columns. + * + * @since 3.4.0 + */ + def count(): DataFrame = toDF(Seq(functions.count(functions.lit(1)).alias("count").expr)) Review Comment: That is different. Dataset.count() returns a long: `def count(): Long` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
amaliujia commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110349585 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] ( agg(exprs.asScala.toMap) } - private[this] def strToExpr(expr: String, inputExpr: proto.Expression): proto.Expression = { + private[this] def strToExpr(expr: String, columnName: String): proto.Expression = { val builder = proto.Expression.newBuilder() expr.toLowerCase(Locale.ROOT) match { // We special handle a few cases that have alias that are not in function registry. case "avg" | "average" | "mean" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("avg") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.avg(columnName) case "stddev" | "std" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("stddev") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.stddev(columnName) // Also special handle count because we need to take care count(*). case "count" | "size" => -// Turn count(*) into count(1) -inputExpr match { - case s if s.hasUnresolvedStar => -val exprBuilder = proto.Expression.newBuilder -exprBuilder.getLiteralBuilder.setInteger(1) -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(exprBuilder) - .setIsDistinct(false) - case _ => -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(inputExpr) - .setIsDistinct(false) -} +functions.col(columnName) case name => builder.getUnresolvedFunctionBuilder .setFunctionName(name) - .addArguments(inputExpr) + .addArguments(df(columnName).expr) Review Comment: Hold on that I will revert this part. It seems hit an issue somewhere if I switch to use functions API. I need to understand more on the functions implementation. I will debug this separately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
hvanhovell commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110350177 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] ( agg(exprs.asScala.toMap) } - private[this] def strToExpr(expr: String, inputExpr: proto.Expression): proto.Expression = { + private[this] def strToExpr(expr: String, columnName: String): proto.Expression = { val builder = proto.Expression.newBuilder() expr.toLowerCase(Locale.ROOT) match { // We special handle a few cases that have alias that are not in function registry. case "avg" | "average" | "mean" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("avg") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.avg(columnName) case "stddev" | "std" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("stddev") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.stddev(columnName) // Also special handle count because we need to take care count(*). case "count" | "size" => -// Turn count(*) into count(1) -inputExpr match { - case s if s.hasUnresolvedStar => -val exprBuilder = proto.Expression.newBuilder -exprBuilder.getLiteralBuilder.setInteger(1) -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(exprBuilder) - .setIsDistinct(false) - case _ => -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(inputExpr) - .setIsDistinct(false) -} +functions.col(columnName) case name => builder.getUnresolvedFunctionBuilder .setFunctionName(name) - .addArguments(inputExpr) + .addArguments(df(columnName).expr) .setIsDistinct(false) } builder.build() } + + /** + * Compute aggregates by specifying a series of aggregate columns. Note that this function by + * default retains the grouping columns in its output. To not retain grouping columns, set + * `spark.sql.retainGroupColumns` to false. + * + * The available aggregate methods are defined in [[org.apache.spark.sql.functions]]. + * + * {{{ + * // Selects the age of the oldest employee and the aggregate expense for each department + * + * // Scala: + * import org.apache.spark.sql.functions._ + * df.groupBy("department").agg(max("age"), sum("expense")) + * + * // Java: + * import static org.apache.spark.sql.functions.*; + * df.groupBy("department").agg(max("age"), sum("expense")); + * }}} + * + * Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change + * to that behavior, set config variable `spark.sql.retainGroupColumns` to `false`. + * {{{ + * // Scala, 1.3.x: + * df.groupBy("department").agg($"department", max("age"), sum("expense")) + * + * // Java, 1.3.x: + * df.groupBy("department").agg(col("department"), max("age"), sum("expense")); + * }}} + * + * @since 3.4.0 + */ + @scala.annotation.varargs + def agg(expr: Column, exprs: Column*): DataFrame = { +toDF((expr +: exprs).map { case c => + c.expr +// TODO: deal with typed columns. +}) + } + + /** + * Count the number of rows for each group. The resulting `DataFrame` will also contain the + * grouping columns. + * + * @since 3.4.0 + */ + def count(): DataFrame = toDF(Seq(functions.count(functions.lit(1)).alias("count").expr)) Review Comment: Can you replace `Dataset.count()` with this code? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
hvanhovell commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110349605 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] ( agg(exprs.asScala.toMap) } - private[this] def strToExpr(expr: String, inputExpr: proto.Expression): proto.Expression = { + private[this] def strToExpr(expr: String, columnName: String): proto.Expression = { val builder = proto.Expression.newBuilder() expr.toLowerCase(Locale.ROOT) match { // We special handle a few cases that have alias that are not in function registry. case "avg" | "average" | "mean" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("avg") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.avg(columnName) case "stddev" | "std" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("stddev") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.stddev(columnName) // Also special handle count because we need to take care count(*). case "count" | "size" => -// Turn count(*) into count(1) -inputExpr match { - case s if s.hasUnresolvedStar => -val exprBuilder = proto.Expression.newBuilder -exprBuilder.getLiteralBuilder.setInteger(1) -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(exprBuilder) - .setIsDistinct(false) - case _ => -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(inputExpr) - .setIsDistinct(false) -} +functions.col(columnName) case name => builder.getUnresolvedFunctionBuilder .setFunctionName(name) - .addArguments(inputExpr) + .addArguments(df(columnName).expr) .setIsDistinct(false) } builder.build() } + + /** + * Compute aggregates by specifying a series of aggregate columns. Note that this function by + * default retains the grouping columns in its output. To not retain grouping columns, set + * `spark.sql.retainGroupColumns` to false. + * + * The available aggregate methods are defined in [[org.apache.spark.sql.functions]]. + * + * {{{ + * // Selects the age of the oldest employee and the aggregate expense for each department + * + * // Scala: + * import org.apache.spark.sql.functions._ + * df.groupBy("department").agg(max("age"), sum("expense")) + * + * // Java: + * import static org.apache.spark.sql.functions.*; + * df.groupBy("department").agg(max("age"), sum("expense")); + * }}} + * + * Note that before Spark 1.4, the default behavior is to NOT retain grouping columns. To change + * to that behavior, set config variable `spark.sql.retainGroupColumns` to `false`. + * {{{ + * // Scala, 1.3.x: + * df.groupBy("department").agg($"department", max("age"), sum("expense")) + * + * // Java, 1.3.x: + * df.groupBy("department").agg(col("department"), max("age"), sum("expense")); + * }}} + * + * @since 3.4.0 + */ + @scala.annotation.varargs + def agg(expr: Column, exprs: Column*): DataFrame = { +toDF((expr +: exprs).map { case c => Review Comment: TBH I would just make `toDF` take `Column`s instead of `Expression`s, that saves you from doing conversions everywhere. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
amaliujia commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110349585 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] ( agg(exprs.asScala.toMap) } - private[this] def strToExpr(expr: String, inputExpr: proto.Expression): proto.Expression = { + private[this] def strToExpr(expr: String, columnName: String): proto.Expression = { val builder = proto.Expression.newBuilder() expr.toLowerCase(Locale.ROOT) match { // We special handle a few cases that have alias that are not in function registry. case "avg" | "average" | "mean" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("avg") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.avg(columnName) case "stddev" | "std" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("stddev") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.stddev(columnName) // Also special handle count because we need to take care count(*). case "count" | "size" => -// Turn count(*) into count(1) -inputExpr match { - case s if s.hasUnresolvedStar => -val exprBuilder = proto.Expression.newBuilder -exprBuilder.getLiteralBuilder.setInteger(1) -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(exprBuilder) - .setIsDistinct(false) - case _ => -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(inputExpr) - .setIsDistinct(false) -} +functions.col(columnName) case name => builder.getUnresolvedFunctionBuilder .setFunctionName(name) - .addArguments(inputExpr) + .addArguments(df(columnName).expr) Review Comment: Hold on that I will revert this part. It seems hit an issue somewhere if I switch to use functions API. I will debug that separately. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
hvanhovell commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110347846 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] ( agg(exprs.asScala.toMap) } - private[this] def strToExpr(expr: String, inputExpr: proto.Expression): proto.Expression = { + private[this] def strToExpr(expr: String, columnName: String): proto.Expression = { val builder = proto.Expression.newBuilder() expr.toLowerCase(Locale.ROOT) match { // We special handle a few cases that have alias that are not in function registry. case "avg" | "average" | "mean" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("avg") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.avg(columnName) Review Comment: You currently don't return this function, but the result of builder.build(). If you do, it should be `functions.avg(columnName).expr`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
hvanhovell commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110348196 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -109,44 +109,131 @@ class RelationalGroupedDataset protected[sql] ( agg(exprs.asScala.toMap) } - private[this] def strToExpr(expr: String, inputExpr: proto.Expression): proto.Expression = { + private[this] def strToExpr(expr: String, columnName: String): proto.Expression = { val builder = proto.Expression.newBuilder() expr.toLowerCase(Locale.ROOT) match { // We special handle a few cases that have alias that are not in function registry. case "avg" | "average" | "mean" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("avg") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.avg(columnName) case "stddev" | "std" => -builder.getUnresolvedFunctionBuilder - .setFunctionName("stddev") - .addArguments(inputExpr) - .setIsDistinct(false) +functions.stddev(columnName) // Also special handle count because we need to take care count(*). case "count" | "size" => -// Turn count(*) into count(1) -inputExpr match { - case s if s.hasUnresolvedStar => -val exprBuilder = proto.Expression.newBuilder -exprBuilder.getLiteralBuilder.setInteger(1) -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(exprBuilder) - .setIsDistinct(false) - case _ => -builder.getUnresolvedFunctionBuilder - .setFunctionName("count") - .addArguments(inputExpr) - .setIsDistinct(false) -} +functions.col(columnName) case name => builder.getUnresolvedFunctionBuilder .setFunctionName(name) - .addArguments(inputExpr) + .addArguments(df(columnName).expr) Review Comment: Use `Column.fn` instead? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ueshin opened a new pull request, #40071: [SPARK-41818][CONNECT][PYTHON][FOLLOWUP][TEST] Enable a doctest for DataFrame.write
ueshin opened a new pull request, #40071: URL: https://github.com/apache/spark/pull/40071 ### What changes were proposed in this pull request? Enables a doctest for `DataFrame.write`. ### Why are the changes needed? Now that `DataFrame.write.saveAsTable` was fixed, we can enabled the doctest. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enabled the doctest. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
amaliujia commented on PR #40070: URL: https://github.com/apache/spark/pull/40070#issuecomment-1435257024 I need to update tests in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
hvanhovell commented on code in PR #40070: URL: https://github.com/apache/spark/pull/40070#discussion_r1110346693 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala: ## @@ -24,7 +24,7 @@ import scala.collection.JavaConverters._ import org.apache.spark.connect.proto /** - * A set of methods for aggregations on a `DataFrame`, created by [[Dataset#groupBy groupBy]], + * A set of methods for aggregations on a `DataFrame`, creted by [[Dataset#groupBy groupBy]], Review Comment: Created? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
amaliujia commented on PR #40070: URL: https://github.com/apache/spark/pull/40070#issuecomment-1435241324 @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #40070: [SPARK-42481][CONNECT] Implement agg.{max,min,mean,count,avg,sum}
amaliujia opened a new pull request, #40070: URL: https://github.com/apache/spark/pull/40070 ### What changes were proposed in this pull request? Adding more API to `agg` including max,min,mean,count,avg,sum. ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface
dtenedor commented on code in PR #40049: URL: https://github.com/apache/spark/pull/40049#discussion_r1110299640 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ColumnDefaultValue.java: ## @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connector.catalog; + +import java.util.Objects; +import javax.annotation.Nonnull; + +import org.apache.spark.sql.connector.expressions.Literal; + +/** + * A class representing the default value of a column. It contains both the SQL string and literal + * value of the user-specified default value expression. The SQL string should be re-evaluated for + * each table writing command, which may produce different values if the default value expression is + * something like {@code CURRENT_DATE()}. The literal value is used to back-fill existing data if + * new columns with default value are added. Note: the back-fill can be lazy. The data sources can + * remember the column default value and let the reader fill the column value when reading existing + * data that do not have these new columns. + */ +public class ColumnDefaultValue { + private String sql; + private Literal value; Review Comment: Data source developers only have to think about the existence default value. For any column where the corresponding field is not present in storage, the data source is responsible for filling this in instead of NULL. On the other hand, the current default value is for DML only. The analyzer inserts this expression for any explicit reference to `DEFAULT`, or for a small subset of implicit cases. For these fields we could clarify with comments, e.g. ``` // This is the original string contents of the SQL expression specified at the // time the column was created in a CREATE TABLE, REPLACE TABLE, or ALTER TABLE // ADD COLUMN command. For example, for "CREATE TABLE t (col INT DEFAULT 42)", // this field is equal to the string literal "42" (without quotation marks). private String sql; // This is the literal value corresponding to the above SQL string. For the above // example, this would be a literal integer with a value of 42. private Literal value; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface
dtenedor commented on code in PR #40049: URL: https://github.com/apache/spark/pull/40049#discussion_r1110299640 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ColumnDefaultValue.java: ## @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connector.catalog; + +import java.util.Objects; +import javax.annotation.Nonnull; + +import org.apache.spark.sql.connector.expressions.Literal; + +/** + * A class representing the default value of a column. It contains both the SQL string and literal + * value of the user-specified default value expression. The SQL string should be re-evaluated for + * each table writing command, which may produce different values if the default value expression is + * something like {@code CURRENT_DATE()}. The literal value is used to back-fill existing data if + * new columns with default value are added. Note: the back-fill can be lazy. The data sources can + * remember the column default value and let the reader fill the column value when reading existing + * data that do not have these new columns. + */ +public class ColumnDefaultValue { + private String sql; + private Literal value; Review Comment: Data source developers only have to think about the existence default value. For any column where the corresponding field is not present in storage, the data source is responsible for filling this in instead of NULL. On the other hand, the current default value is for DML only. For these fields we could clarify with comments, e.g. ``` // This is the original string contents of the SQL expression specified at the // time the column was created in a CREATE TABLE, REPLACE TABLE, or ALTER TABLE // ADD COLUMN command. For example, for "CREATE TABLE t (col INT DEFAULT 42)", // this field is equal to the string literal "42" (without quotation marks). private String sql; // This is the literal value corresponding to the above SQL string. For the above // example, this would be a literal integer with a value of 42. private Literal value; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor commented on a diff in pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface
dtenedor commented on code in PR #40049: URL: https://github.com/apache/spark/pull/40049#discussion_r1110299640 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ColumnDefaultValue.java: ## @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connector.catalog; + +import java.util.Objects; +import javax.annotation.Nonnull; + +import org.apache.spark.sql.connector.expressions.Literal; + +/** + * A class representing the default value of a column. It contains both the SQL string and literal + * value of the user-specified default value expression. The SQL string should be re-evaluated for + * each table writing command, which may produce different values if the default value expression is + * something like {@code CURRENT_DATE()}. The literal value is used to back-fill existing data if + * new columns with default value are added. Note: the back-fill can be lazy. The data sources can + * remember the column default value and let the reader fill the column value when reading existing + * data that do not have these new columns. + */ +public class ColumnDefaultValue { + private String sql; + private Literal value; Review Comment: Data source developers only have to think about the existence default value. For any column where the corresponding field is not present in storage, the data source is responsible for filling this in instead of NULL. On the other hand, the existence default value is for DML only. For these fields we could clarify with comments, e.g. ``` // This is the original string contents of the SQL expression specified at the // time the column was created in a CREATE TABLE, REPLACE TABLE, or ALTER TABLE // ADD COLUMN command. For example, for "CREATE TABLE t (col INT DEFAULT 42)", // this field is equal to the string literal "42" (without quotation marks). private String sql; // This is the literal value corresponding to the above SQL string. For the above // example, this would be a literal integer with a value of 42. private Literal value; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40054: [SPARK-42477] [CONNECT] [PYTHON]: accept user_agent in spark connect's connection string
amaliujia commented on code in PR #40054: URL: https://github.com/apache/spark/pull/40054#discussion_r1110286421 ## connector/connect/docs/client-connection-string.md: ## @@ -81,6 +82,15 @@ sc://hostname:port/;param1=value;param2=value user_id=Martin + +user_agent +String +The user agent acting on behalf of the user, typically applications +that use Spark Connect to implement its functionality and execute Spark +requests on behalf of the user. +Default: _SPARK_CONNECT_PYTHON in the Python client Review Comment: I see that this is changed to default value in Python client. This is fine. Probably in the future we just list the default value for the Scala client. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40054: [SPARK-42477] [CONNECT] [PYTHON]: accept user_agent in spark connect's connection string
amaliujia commented on code in PR #40054: URL: https://github.com/apache/spark/pull/40054#discussion_r1110284777 ## connector/connect/docs/client-connection-string.md: ## @@ -81,6 +82,15 @@ sc://hostname:port/;param1=value;param2=value user_id=Martin + +user_agent +String +The user agent acting on behalf of the user, typically applications +that use Spark Connect to implement its functionality and execute Spark +requests on behalf of the user. +Default: _SPARK_CONNECT_PYTHON in the Python client Review Comment: Wait I thought this is addressed: the default value is empty right? `_SPARK_CONNECT_PYTHON` is Python client set value and Scala client for example will set a different value. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #40069: [SPARK-42480][SQL] Improve the performance of drop partitions
LuciferYang commented on code in PR #40069: URL: https://github.com/apache/spark/pull/40069#discussion_r1110257681 ## sql/catalyst/src/main/scala/org/apache/spark/sql/util/PartitioningUtils.scala: ## @@ -30,6 +30,7 @@ import org.apache.spark.sql.types.{CharType, DataType, StringType, StructField, import org.apache.spark.unsafe.types.UTF8String private[sql] object PartitioningUtils { + private val PATTERN_FOR_KEY_EQ_VAL = "(.+)=(.+)".r Review Comment: So if the data corresponding to the partition `a=1` is stored in dir `/1/`, will there be a bad case with this pr? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #40069: [SPARK-42480][SQL] Improve the performance of drop partitions
LuciferYang commented on code in PR #40069: URL: https://github.com/apache/spark/pull/40069#discussion_r1110243153 ## sql/catalyst/src/main/scala/org/apache/spark/sql/util/PartitioningUtils.scala: ## @@ -30,6 +30,7 @@ import org.apache.spark.sql.types.{CharType, DataType, StringType, StructField, import org.apache.spark.unsafe.types.UTF8String private[sql] object PartitioningUtils { + private val PATTERN_FOR_KEY_EQ_VAL = "(.+)=(.+)".r Review Comment: Too idealistic, not all partition tables follow this rule. For example, we can use `alter table ... partition(...) set location ...` to relocate the partition to any directory -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40065: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.5
LuciferYang commented on PR #40065: URL: https://github.com/apache/spark/pull/40065#issuecomment-1435129166 Yeah, Spark 3.4.0 does not need this pr. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #40069: [SPARK-42480][SQL] Improve the performance of drop partitions
LuciferYang commented on code in PR #40069: URL: https://github.com/apache/spark/pull/40069#discussion_r1110243153 ## sql/catalyst/src/main/scala/org/apache/spark/sql/util/PartitioningUtils.scala: ## @@ -30,6 +30,7 @@ import org.apache.spark.sql.types.{CharType, DataType, StringType, StructField, import org.apache.spark.unsafe.types.UTF8String private[sql] object PartitioningUtils { + private val PATTERN_FOR_KEY_EQ_VAL = "(.+)=(.+)".r Review Comment: Too idealistic, not all partition tables follow this rule. For example, we can use `alter table ... partition(...) set location ...` to relocate the partition to any directory. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40065: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.5
dongjoon-hyun commented on PR #40065: URL: https://github.com/apache/spark/pull/40065#issuecomment-1435116865 I'm trying to assess the issue. So, those combination issue is not the AS-IS Apache Spark issue in both master/branch-3.4, right? FYI, Cyclone plugin 2.7.4 issue is a known one. When I started SBOM works, 2.7.4 was the lastest but was unusable across multiple ASF projects. That was the main reason I chose 2.7.3 instead of the latest at that time. I'm not quite sure if 2.7.5 is stable enough. Anyway, we can apply this PR on `master` branch for Apache Spark 3.5.0 only separately from the Maven issue. Maven is also another big issues always. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rehevkor5 commented on a diff in pull request #29591: [SPARK-32714][PYTHON] Initial pyspark-stubs port.
rehevkor5 commented on code in PR #29591: URL: https://github.com/apache/spark/pull/29591#discussion_r1110230025 ## python/pyspark/sql/dataframe.pyi: ## @@ -0,0 +1,324 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from typing import overload +from typing import ( +Any, +Callable, +Dict, +Iterator, +List, +Optional, +Tuple, +Union, +) + +from py4j.java_gateway import JavaObject # type: ignore[import] + +from pyspark.sql._typing import ColumnOrName, LiteralType, OptionalPrimitiveType +from pyspark.sql.types import ( # noqa: F401 +StructType, +StructField, +StringType, +IntegerType, +Row, +) # noqa: F401 +from pyspark.sql.context import SQLContext +from pyspark.sql.group import GroupedData +from pyspark.sql.readwriter import DataFrameWriter, DataFrameWriterV2 +from pyspark.sql.streaming import DataStreamWriter +from pyspark.sql.column import Column +from pyspark.rdd import RDD +from pyspark.storagelevel import StorageLevel + +from pyspark.sql.pandas.conversion import PandasConversionMixin +from pyspark.sql.pandas.map_ops import PandasMapOpsMixin + +class DataFrame(PandasMapOpsMixin, PandasConversionMixin): +sql_ctx: SQLContext +is_cached: bool +def __init__(self, jdf: JavaObject, sql_ctx: SQLContext) -> None: ... +@property +def rdd(self) -> RDD[Row]: ... +@property +def na(self) -> DataFrameNaFunctions: ... +@property +def stat(self) -> DataFrameStatFunctions: ... +def toJSON(self, use_unicode: bool = ...) -> RDD[str]: ... +def registerTempTable(self, name: str) -> None: ... +def createTempView(self, name: str) -> None: ... +def createOrReplaceTempView(self, name: str) -> None: ... +def createGlobalTempView(self, name: str) -> None: ... +@property +def write(self) -> DataFrameWriter: ... +@property +def writeStream(self) -> DataStreamWriter: ... +@property +def schema(self) -> StructType: ... +def printSchema(self) -> None: ... +def explain( +self, extended: Optional[Union[bool, str]] = ..., mode: Optional[str] = ... +) -> None: ... +def exceptAll(self, other: DataFrame) -> DataFrame: ... +def isLocal(self) -> bool: ... +@property +def isStreaming(self) -> bool: ... +def show( +self, n: int = ..., truncate: Union[bool, int] = ..., vertical: bool = ... +) -> None: ... +def checkpoint(self, eager: bool = ...) -> DataFrame: ... +def localCheckpoint(self, eager: bool = ...) -> DataFrame: ... +def withWatermark( +self, eventTime: ColumnOrName, delayThreshold: str +) -> DataFrame: ... +def hint(self, name: str, *parameters: Any) -> DataFrame: ... +def count(self) -> int: ... +def collect(self) -> List[Row]: ... +def toLocalIterator(self, prefetchPartitions: bool = ...) -> Iterator[Row]: ... +def limit(self, num: int) -> DataFrame: ... +def take(self, num: int) -> List[Row]: ... +def tail(self, num: int) -> List[Row]: ... +def foreach(self, f: Callable[[Row], None]) -> None: ... +def foreachPartition(self, f: Callable[[Iterator[Row]], None]) -> None: ... Review Comment: Shouldn't this be `Iterable[Row]` instead of `Iterator[Row]`, to match https://github.com/apache/spark/pull/29591/files#diff-6349afe05d41878cc15995c96a14b011d6aef04b779e136f711eab989b71da6cR215 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rehevkor5 commented on a diff in pull request #34225: [SPARK-36885][PYTHON] Inline type hints for pyspark.sql.dataframe
rehevkor5 commented on code in PR #34225: URL: https://github.com/apache/spark/pull/34225#discussion_r1110226166 ## python/pyspark/sql/dataframe.py: ## @@ -784,7 +824,7 @@ def foreach(self, f): """ self.rdd.foreach(f) -def foreachPartition(self, f): +def foreachPartition(self, f: Callable[[Iterator[Row]], None]) -> None: Review Comment: Shouldn't this be `Iterable[Row]` instead of `Iterator[Row]`, so that it matches https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1750 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface
gengliangwang commented on code in PR #40049: URL: https://github.com/apache/spark/pull/40049#discussion_r1110209562 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ColumnDefaultValue.java: ## @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connector.catalog; + +import java.util.Objects; +import javax.annotation.Nonnull; + +import org.apache.spark.sql.connector.expressions.Literal; + +/** + * A class representing the default value of a column. It contains both the SQL string and literal + * value of the user-specified default value expression. The SQL string should be re-evaluated for + * each table writing command, which may produce different values if the default value expression is + * something like {@code CURRENT_DATE()}. The literal value is used to back-fill existing data if + * new columns with default value are added. Note: the back-fill can be lazy. The data sources can + * remember the column default value and let the reader fill the column value when reading existing + * data that do not have these new columns. + */ +public class ColumnDefaultValue { + private String sql; + private Literal value; Review Comment: Actually, I have read the classdoc before commenting...I don't have a better suggestion. Let's enhance the doc later -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wecharyu opened a new pull request, #40069: [SPARK-42480][SQL] Improve the performance of drop partitions
wecharyu opened a new pull request, #40069: URL: https://github.com/apache/spark/pull/40069 ### What changes were proposed in this pull request? 1. Change to get matching partition names rather than partition objects when drop partitions ### Why are the changes needed? 1. Partition names are enough to drop partitions 2. It can reduce the time overhead and driver memory overhead. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Passing the existing tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40065: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.5
LuciferYang commented on PR #40065: URL: https://github.com/apache/spark/pull/40065#issuecomment-1435050161 Please let me explain my intention more: 1. First of all, I want to update maven to 3.9.0(keep use CycloneDX 2.7.3), then I found the following error: ``` [ERROR] An error occurred attempting to read POM org.codehaus.plexus.util.xml.pull.XmlPullParserException: UTF-8 BOM plus xml decl of ISO-8859-1 is incompatible (position: START_DOCUMENT seen https://github.com/LuciferYang/spark/actions/runs/4206035140/jobs/7299042843 later 2. then I want to test maven 3.9.0 + CycloneDX 2.7.4 couple of days ago, but there an error same as `maven 3.8.7 + cyclonedx-maven-plugin 2.7.4`, I think we should see it here: https://github.com/LuciferYang/spark/runs/11424487074 later 3. then I test maven 3.9.0 + CycloneDX 2.7.5 today, there is no above issues(we can check https://github.com/LuciferYang/spark/runs/11424568023 later). If I want to upgrade Spark to use maven 3.9.0, I must upgrade cyclonedx-maven-plugin to 2.7.5, so should I upgrade them in one pr at the same time? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #40065: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.5
LuciferYang commented on PR #40065: URL: https://github.com/apache/spark/pull/40065#issuecomment-1435045931 Yes, we use CycloneDX 2.7.3. So I should not explain that 2.7.4 has such issue in the pr description, because it does not affect Spark now, am I right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #39942: [SPARK-42398][SQL] Refine default column value framework
cloud-fan commented on code in PR #39942: URL: https://github.com/apache/spark/pull/39942#discussion_r1110153301 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -3108,7 +3108,7 @@ object SQLConf { "provided values when the corresponding fields are not present in storage.") .version("3.4.0") .stringConf - .createWithDefault("csv,json,orc,parquet") + .createWithDefault("csv,json,orc,parquet,hive") Review Comment: ```suggestion .createWithDefault("csv,json,orc,parquet") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface
cloud-fan commented on code in PR #40049: URL: https://github.com/apache/spark/pull/40049#discussion_r1110150524 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ColumnDefaultValue.java: ## @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connector.catalog; + +import java.util.Objects; +import javax.annotation.Nonnull; + +import org.apache.spark.sql.connector.expressions.Literal; + +/** + * A class representing the default value of a column. It contains both the SQL string and literal + * value of the user-specified default value expression. The SQL string should be re-evaluated for + * each table writing command, which may produce different values if the default value expression is + * something like {@code CURRENT_DATE()}. The literal value is used to back-fill existing data if + * new columns with default value are added. Note: the back-fill can be lazy. The data sources can + * remember the column default value and let the reader fill the column value when reading existing + * data that do not have these new columns. + */ +public class ColumnDefaultValue { + private String sql; + private Literal value; Review Comment: Can you also read the classdoc? If you still think the name is confusing, let's figure out a better one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface
cloud-fan commented on code in PR #40049: URL: https://github.com/apache/spark/pull/40049#discussion_r1110146649 ## sql/core/src/test/scala/org/apache/spark/sql/connector/TestV2SessionCatalogBase.scala: ## @@ -64,6 +64,15 @@ private[connector] trait TestV2SessionCatalogBase[T <: Table] extends Delegating } } + override def createTable( + ident: Identifier, + columns: Array[Column], + partitions: Array[Transform], + properties: java.util.Map[String, String]): Table = { +createTable(ident, CatalogV2Util.v2ColumnsToStructType(columns), partitions, properties) + } + + // TODO: remove it when no tests calling this deprecated method. Review Comment: I haven't created it. This is a test-only change so the priority is low. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #40049: [SPARK-42398][SQL] Refine default column value DS v2 interface
cloud-fan commented on code in PR #40049: URL: https://github.com/apache/spark/pull/40049#discussion_r1110135070 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ColumnDefaultValue.java: ## @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connector.catalog; + +import java.util.Objects; +import javax.annotation.Nonnull; + +import org.apache.spark.sql.connector.expressions.Literal; + +/** + * A class representing the default value of a column. It contains both the SQL string and literal + * value of the user-specified default value expression. The SQL string should be re-evaluated for + * each table writing command, which may produce different values if the default value expression is + * something like {@code CURRENT_DATE()}. The literal value is used to back-fill existing data if + * new columns with default value are added. Note: the back-fill can be lazy. The data sources can + * remember the column default value and let the reader fill the column value when reading existing + * data that do not have these new columns. + */ +public class ColumnDefaultValue { + private String sql; + private Literal value; Review Comment: A default value has two parts: the SQL string and the evaluated literal value. I don't think current default and exist default is easier to understand for data source developers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40065: [SPARK-42382][BUILD] Upgrade `cyclonedx-maven-plugin` to 2.7.5
dongjoon-hyun commented on PR #40065: URL: https://github.com/apache/spark/pull/40065#issuecomment-1435007566 I mean in our GitHub Action repo. We are using CycloneDX 2.7.3, aren't we? > I make another one build with maven 3.8.7 + cyclonedx-maven-plugin 2.7.4 https://github.com/LuciferYang/spark/actions/runs/4205904014/jobs/7298678641 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org