[GitHub] [spark] LuciferYang commented on pull request #38075: [WIP][SPARK-40633][BUILD] Upgrade janino to 3.1.8
LuciferYang commented on PR #38075: URL: https://github.com/apache/spark/pull/38075#issuecomment-1321605889 will check 3.1.9 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38722: [SPARK-41200][CORE] BytesToBytesMap's longArray size can be up to MAX_CAPACITY
cloud-fan commented on code in PR #38722: URL: https://github.com/apache/spark/pull/38722#discussion_r1027662257 ## core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java: ## @@ -812,9 +812,7 @@ public boolean append(Object kbase, long koff, int klen, Object vbase, long voff // If the map has reached its growth threshold, try to grow it. if (numKeys >= growthThreshold) { - // We use two array entries per key, so the array size is twice the capacity. - // We should compare the current capacity of the array, instead of its size. - if (longArray.size() / 2 < MAX_CAPACITY) { + if (longArray.size() < MAX_CAPACITY) { Review Comment: I think we need to read the code of this entire class to understand the capacity. According to the code comment ``` /** * The maximum number of keys that BytesToBytesMap supports. The hash table has to be * power-of-2-sized and its backing Java array can contain at most (1 30) elements, * since that's the largest power-of-2 that's less than Integer.MAX_VALUE. We need two long array * entries per key, giving us a maximum capacity of (1 29). */ public static final int MAX_CAPACITY = (1 << 29); ``` This is to restrict the number of keys, and `longArray.size() / 2` is the number of keys. The code here should be correct. I also saw code like this ``` private void allocate(int capacity) { assert (capacity >= 0); capacity = Math.max((int) Math.min(MAX_CAPACITY, ByteArrayMethods.nextPowerOf2(capacity)), 64); assert (capacity <= MAX_CAPACITY); longArray = allocateArray(capacity * 2L); longArray.zeroOut(); this.growthThreshold = (int) (capacity * loadFactor); this.mask = capacity - 1; } ``` `longArray` is 2x capacity. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #38711: [SPARK-41192][Core] Remove unscheduled speculative tasks when task finished to obtain better dynamic
mridulm commented on PR #38711: URL: https://github.com/apache/spark/pull/38711#issuecomment-1321576092 Thanks for the PR to fix this bug ! I will need to think more about this issue, but I am leaning towards a variant of the solution proposed. Namely: * `stageAttemptToNumSpeculativeTasks` -> `stageAttemptToUnsubmittedSpeculativeTasks` * These are task indices which have been marked speculatable, which have not yet been launched. * When a speculative task starts, remove it from `stageAttemptToUnsubmittedSpeculativeTasks` * When a non-speculative task finishes, remove it from `stageAttemptToUnsubmittedSpeculativeTasks` Thoughts @Ngone51, @toujours33, @LuciferYang ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38495: [SPARK-35531][SQL] Update hive table stats without unnecessary convert
cloud-fan commented on code in PR #38495: URL: https://github.com/apache/spark/pull/38495#discussion_r1027656667 ## sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala: ## @@ -894,12 +895,14 @@ class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter sql(insertString.toLowerCase(Locale.ROOT)) sql(insertString.toUpperCase(Locale.ROOT)) + spark.sessionState.catalog.alterTableStats(TableIdentifier("test1"), None) Review Comment: Does it test anything? It just invokes `alterTableStats` but does no verification. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38703: [SPARK-41191] [SQL] Cache Table is not working while nested caches exist
cloud-fan commented on code in PR #38703: URL: https://github.com/apache/spark/pull/38703#discussion_r1027651034 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala: ## @@ -355,7 +355,7 @@ case class ListQuery( plan.canonicalized, outerAttrs.map(_.canonicalized), ExprId(0), - childOutputs.map(_.canonicalized.asInstanceOf[Attribute]), + plan.canonicalized.output, Review Comment: This looks correct, but I don't know how is this related to the cache problem. Can you elaborate? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38687: [SPARK-41154][SQL] Incorrect relation caching for queries with time travel spec
cloud-fan commented on PR #38687: URL: https://github.com/apache/spark/pull/38687#issuecomment-1321552064 Oh wait a minute. Due to the spark catalog extension (set via `spark.sql.catalog.spark_catalog`), we can have tables supporting time travel in the v1 catalog as well. I think `SessionCatalog.tableRelationCache` also need a fix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #38560: [WIP][SPARK-38005][core] Support cleaning up merged shuffle files and state from external shuffle service
mridulm commented on PR #38560: URL: https://github.com/apache/spark/pull/38560#issuecomment-1321551358 @yabola, there is quite a lot of nontrivial overlap between this PR and @wankunde's PR is trying doing. Would be great if you both can coordinate on this - I would love to get this functionality merged before we starting getting closer to code freeze for 3.4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38687: [SPARK-41154][SQL] Incorrect relation caching for queries with time travel spec
cloud-fan commented on PR #38687: URL: https://github.com/apache/spark/pull/38687#issuecomment-1321550782 thanks, merging to master/3.3! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #38560: [WIP][SPARK-38005][core] Support cleaning up merged shuffle files and state from external shuffle service
mridulm commented on PR #38560: URL: https://github.com/apache/spark/pull/38560#issuecomment-1321548579 > One things that I know need to be addressed are: > Some merge data infos are not saved on the driver because they are too small ( controlled by spark.shuffle.push.minShuffleSizeToWait) please see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L2295 In this case, we should fire a remove immediately - we are not going to use it for this app anyway ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on a diff in pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI
gengliangwang commented on code in PR #38567: URL: https://github.com/apache/spark/pull/38567#discussion_r1027640551 ## core/src/main/scala/org/apache/spark/status/AppStatusStore.scala: ## @@ -769,7 +772,14 @@ private[spark] object AppStatusStore { def createLiveStore( conf: SparkConf, appStatusSource: Option[AppStatusSource] = None): AppStatusStore = { -val store = new ElementTrackingStore(new InMemoryStore(), conf) +val storePath = conf.get(LIVE_UI_LOCAL_STORE_DIR).map(new File(_)) +// For the disk-based KV store of live UI, let's simply make it ROCKSDB only for now, +// instead of supporting both LevelDB and RocksDB. RocksDB is built based on LevelDB with +// improvements on writes and reads. Furthermore, we can reuse the RocksDBFileManager in +// streaming for replicating the local RocksDB file to DFS. The replication in DFS can be +// used for Spark history server. Review Comment: Thanks, I have removed the comment about replication. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #37922: [SPARK-40480][SHUFFLE] Remove push-based shuffle data after query finished
mridulm commented on code in PR #37922: URL: https://github.com/apache/spark/pull/37922#discussion_r1027637860 ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java: ## @@ -393,6 +394,35 @@ public void applicationRemoved(String appId, boolean cleanupLocalDirs) { } } + @Override + public void removeShuffleMerge(RemoveShuffleMerge msg) { +String appId = msg.appId; +int appAttemptId = msg.appAttemptId; +int shuffleId = msg.shuffleId; +AppShuffleInfo appShuffleInfo = validateAndGetAppShuffleInfo(appId); +if (appShuffleInfo.attemptId != appAttemptId) { + throw new IllegalArgumentException( + String.format("The attempt id %s in this RemoveShuffleMerge message does not match " + + "with the current attempt id %s stored in shuffle service for application %s", + appAttemptId, appShuffleInfo.attemptId, appId)); +} + +appShuffleInfo.shuffles.compute(shuffleId, (shuffleIdKey, partitionsInfo) -> { + if (null != partitionsInfo) { Review Comment: The validation related to `shuffleMergeId`, from `finalizeShuffleMerge`, is applicable here depending on `shuffleMergeId`. I have given the details [here](https://github.com/apache/spark/pull/37922#discussion_r990753031) - please refer to it. ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/NoOpMergedShuffleFileManager.java: ## @@ -84,4 +85,9 @@ public MergedBlockMeta getMergedBlockMeta( public String[] getMergedBlockDirs(String appId) { throw new UnsupportedOperationException("Cannot handle shuffle block merge"); } + + @Override + public void removeShuffleMerge(RemoveShuffleMerge removeShuffleMerge) { +throw new UnsupportedOperationException("Cannot handle shuffle block merge"); Review Comment: nit: ```suggestion throw new UnsupportedOperationException("Cannot handle merged shuffle remove"); ``` ## common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/MergedShuffleFileManager.java: ## @@ -121,6 +122,13 @@ MergedBlockMeta getMergedBlockMeta( */ String[] getMergedBlockDirs(String appId); + /** + * Remove shuffle merge data files. + * + * @param removeShuffleMerge Remove shuffle merge RPC Review Comment: ```suggestion * @param removeShuffleMerge contains shuffle details (appId, shuffleId, etc) to uniquely identify a shuffle to be removed ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request, #38737: [SPARK-41174][SQL] Propagate an error class to users for invalid `format` of `to_binary()`
LuciferYang opened a new pull request, #38737: URL: https://github.com/apache/spark/pull/38737 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] erenavsarogullari commented on pull request #38736: [WIP][SPARK-41214][SQL] - SubPlan metrics are missed when AQE is enabled under InMemoryRelation
erenavsarogullari commented on PR #38736: URL: https://github.com/apache/spark/pull/38736#issuecomment-1321545696 cc @cloud-fan @Ngone51 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] erenavsarogullari opened a new pull request, #38736: [WIP][SPARK-41214][SQL] - SubPlan metrics under InMemoryRelation are missed when …
erenavsarogullari opened a new pull request, #38736: URL: https://github.com/apache/spark/pull/38736 ### What changes were proposed in this pull request? `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning` enables AQE optimizations under `InMemoryRelation`(IMR) nodes by creating separated sub-SparkPlans for AQE optimizations. However, when `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning = true`, Spark UI does not show correct DAG due to lack of final sub-plans (under IMR) submissions (into UI). **DAG before fix:** https://user-images.githubusercontent.com/1437738/202984370-c179707a-c091-4133-adb6-d5009c98875a.png;> **DAG after fix:** https://user-images.githubusercontent.com/1437738/202984481-a63ba5e2-fc66-4dc7-98a0-f2233a93e5c8.png;> ### Why are the changes needed? `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning` enables AQE optimizations under `InMemoryRelation`(IMR) nodes. Following sample query has IMR node on both `BroadcastHashJoin` legs. However, when `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning = true`, following datas are missed due to lack of final sub-plans (under IMR) submissions (into UI). ### Does this PR introduce _any_ user-facing change? Currently, Spark UI does not show final DAG when `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning = true` so this causes some physical operator metrics as missed. This PR aims to fix this problem. ### How was this patch tested? Specific UT Coverage will be added (In Progress). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38713: [SPARK-41195][SQL] Support PIVOT/UNPIVOT with join children
cloud-fan commented on code in PR #38713: URL: https://github.com/apache/spark/pull/38713#discussion_r1027630891 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala: ## @@ -1263,60 +1263,71 @@ class AstBuilder extends SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit * }}} */ override def visitRelation(ctx: RelationContext): LogicalPlan = withOrigin(ctx) { -withJoinRelations(plan(ctx.relationPrimary), ctx) +withRelationExtensions(ctx, plan(ctx.relationPrimary)) + } + + private def withRelationExtensions(ctx: RelationContext, query: LogicalPlan): LogicalPlan = { +ctx.relationExtension().asScala.foldLeft(query) { (left, extension) => + if (extension.joinRelation() != null) { +withJoinRelation(extension.joinRelation(), left) + } else if (extension.pivotClause() != null) { +withPivot(extension.pivotClause(), left) + } else { +assert(extension.unpivotClause() != null) +withUnpivot(extension.unpivotClause(), left) + } +} } /** - * Join one more [[LogicalPlan]]s to the current logical plan. + * Join one more [[LogicalPlan]] to the current logical plan. */ - private def withJoinRelations(base: LogicalPlan, ctx: RelationContext): LogicalPlan = { -ctx.joinRelation.asScala.foldLeft(base) { (left, join) => Review Comment: The actually code change is very small, just remove this loop and rename a few variables. ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala: ## @@ -1263,60 +1263,71 @@ class AstBuilder extends SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit * }}} */ override def visitRelation(ctx: RelationContext): LogicalPlan = withOrigin(ctx) { -withJoinRelations(plan(ctx.relationPrimary), ctx) +withRelationExtensions(ctx, plan(ctx.relationPrimary)) + } + + private def withRelationExtensions(ctx: RelationContext, query: LogicalPlan): LogicalPlan = { +ctx.relationExtension().asScala.foldLeft(query) { (left, extension) => + if (extension.joinRelation() != null) { +withJoinRelation(extension.joinRelation(), left) + } else if (extension.pivotClause() != null) { +withPivot(extension.pivotClause(), left) + } else { +assert(extension.unpivotClause() != null) +withUnpivot(extension.unpivotClause(), left) + } +} } /** - * Join one more [[LogicalPlan]]s to the current logical plan. + * Join one more [[LogicalPlan]] to the current logical plan. */ - private def withJoinRelations(base: LogicalPlan, ctx: RelationContext): LogicalPlan = { -ctx.joinRelation.asScala.foldLeft(base) { (left, join) => Review Comment: The actual code change is very small, just remove this loop and rename a few variables. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size
dongjoon-hyun commented on PR #38333: URL: https://github.com/apache/spark/pull/38333#issuecomment-1321531047 Thank you, @mridulm ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size
mridulm commented on PR #38333: URL: https://github.com/apache/spark/pull/38333#issuecomment-1321529913 I was on two minds whether to fix this in 3.3 as well ... Yes, 3.3 is affected by it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI
LuciferYang commented on code in PR #38567: URL: https://github.com/apache/spark/pull/38567#discussion_r1027625306 ## core/src/main/scala/org/apache/spark/status/KVUtils.scala: ## @@ -80,6 +89,44 @@ private[spark] object KVUtils extends Logging { db } + def createKVStore( + storePath: Option[File], + diskBackend: HybridStoreDiskBackend.Value, + conf: SparkConf): KVStore = { +storePath.map { path => + val dir = diskBackend match { +case LEVELDB => "listing.ldb" +case ROCKSDB => "listing.rdb" + } + + val dbPath = Files.createDirectories(new File(path, dir).toPath()).toFile() + Utils.chmod700(dbPath) + + + val metadata = FsHistoryProviderMetadata( +FsHistoryProvider.CURRENT_LISTING_VERSION, +AppStatusStore.CURRENT_VERSION, +conf.get(History.HISTORY_LOG_DIR)) + + try { +open(dbPath, metadata, conf, Some(diskBackend)) + } catch { Review Comment: If unexpected data cleaning doesn't occur in the live UI scenario, I think it is OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #38567: [SPARK-41054][UI][CORE] Support RocksDB as KVStore in live UI
mridulm commented on code in PR #38567: URL: https://github.com/apache/spark/pull/38567#discussion_r1027618122 ## core/src/main/scala/org/apache/spark/status/AppStatusStore.scala: ## @@ -769,7 +772,14 @@ private[spark] object AppStatusStore { def createLiveStore( conf: SparkConf, appStatusSource: Option[AppStatusSource] = None): AppStatusStore = { -val store = new ElementTrackingStore(new InMemoryStore(), conf) +val storePath = conf.get(LIVE_UI_LOCAL_STORE_DIR).map(new File(_)) +// For the disk-based KV store of live UI, let's simply make it ROCKSDB only for now, +// instead of supporting both LevelDB and RocksDB. RocksDB is built based on LevelDB with +// improvements on writes and reads. Furthermore, we can reuse the RocksDBFileManager in +// streaming for replicating the local RocksDB file to DFS. The replication in DFS can be +// used for Spark history server. Review Comment: nit: Can we remove the part about replication for now ? When we add support for it, we can suitably modify the comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size
dongjoon-hyun commented on PR #38333: URL: https://github.com/apache/spark/pull/38333#issuecomment-1321516325 Thank you, @gaoyajun02 , @mridulm , @otterc . - Do we need to backport this to branch-3.3? - According to the previous failure description, what happens in branch-3.3 in case of failure? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liuzqt commented on a diff in pull request #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultCollec
liuzqt commented on code in PR #38704: URL: https://github.com/apache/spark/pull/38704#discussion_r1027614714 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -2251,7 +2251,11 @@ class DatasetLargeResultCollectingSuite extends QueryTest with SharedSparkSession { override protected def sparkConf: SparkConf = super.sparkConf.set(MAX_RESULT_SIZE.key, "4g") - test("collect data with single partition larger than 2GB bytes array limit") { + // SPARK-41193: Ignore this suite because it cannot run successfully with Spark + // default Java Options, if user need do local test, please make the following changes: + // - Maven test: change `-Xmx4g` of `scalatest-maven-plugin` in `sql/core/pom.xml` to `-Xmx10g` + // - SBT test: change `-Xmx4g` of `Test / javaOptions` in `SparkBuild.scala` to `-Xmx10g` + ignore("collect data with single partition larger than 2GB bytes array limit") { Review Comment: I think we can leave it as ignore for now with the comments about using larger mem to make it work. I'm not sure if we're able to configure the build args for a specific test suite. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size
mridulm commented on PR #38333: URL: https://github.com/apache/spark/pull/38333#issuecomment-1321511612 Merged to master. Thanks for fixing this @gaoyajun02 ! Thanks for the review @otterc :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] asfgit closed pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size
asfgit closed pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size URL: https://github.com/apache/spark/pull/38333 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #38467: [SPARK-40987][CORE] Avoid creating a directory when deleting a block, causing DAGScheduler to not work
mridulm commented on PR #38467: URL: https://github.com/apache/spark/pull/38467#issuecomment-1321508245 Agree with @Ngone51, there are two issues here. a) When we have locked for read/write, we expect it to be unlocked and exceptions to be handled gracefully. In this case, `removeBlockInternal` should ensure the lock is unlocked gracefully. Instead of catching `Exception`, I would suggest to move`removeBlock` into `finally` and everything above it in `removeBlockInternal` into a `try` block. A quick look indicated the other uses of `lockForWriting` should be fine - but perhaps something we should audit in future @Ngone51 ! b) Ensure we do not recreate a directory when exit'ing (it is not limited to `removeBlockInternal` in this PR). In addition to (a), I do believe we should do what is in this PR @Ngone51. Thoughts ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent
dongjoon-hyun commented on code in PR #38683: URL: https://github.com/apache/spark/pull/38683#discussion_r1027605394 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala: ## @@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with SharedSparkSession { } } } + + metadataColumnsTest("SPARK-41151: consistent _metadata nullability " + +"between analyzed and executed", schema) { (df, _, _) => Review Comment: Ah, you are right. My bad. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent
HeartSaVioR commented on code in PR #38683: URL: https://github.com/apache/spark/pull/38683#discussion_r1027594911 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala: ## @@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with SharedSparkSession { } } } + + metadataColumnsTest("SPARK-41151: consistent _metadata nullability " + +"between analyzed and executed", schema) { (df, _, _) => Review Comment: I was also incorrect. This is actually very clear: https://github.com/databricks/scala-style-guide#indent > For method and class constructor invocations, use 2 space indentation for its parameters and put each in each line when the parameters don't fit in two lines. Technically, this is method "invocation", not method "definition". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #38692: [SPARK-41183][SQL] Add an extension API to do plan normalization for caching
cloud-fan closed pull request #38692: [SPARK-41183][SQL] Add an extension API to do plan normalization for caching URL: https://github.com/apache/spark/pull/38692 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38692: [SPARK-41183][SQL] Add an extension API to do plan normalization for caching
cloud-fan commented on PR #38692: URL: https://github.com/apache/spark/pull/38692#issuecomment-1321487076 thanks for the review, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #38731: [SPARK-41209][PYSPARK] Improve PySpark type inference in _merge_type method
sadikovi commented on PR #38731: URL: https://github.com/apache/spark/pull/38731#issuecomment-1321483028 @HyukjinKwon I noticed that NullType in PySpark is on the list of atomic types which it is not, in fact, it is mentioned in the type's doc string. However, I tried to remove it but encountered test failures in `pyspark.pandas.tests.indexes.test_base`. I am going to keep it as is but could you comment on why NullType was added to `_atomic_types` list? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38735: [SPARK-41213][CONNECT][PYTHON] Implement `DataFrame.__repr__` and `DataFrame.dtypes`
zhengruifeng commented on code in PR #38735: URL: https://github.com/apache/spark/pull/38735#discussion_r1027583933 ## python/pyspark/sql/connect/dataframe.py: ## @@ -115,6 +115,9 @@ def __init__( self._cache: Dict[str, Any] = {} self._session: "RemoteSparkSession" = session +def __repr__(self) -> str: +return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes)) Review Comment: this follows the default behavior of https://github.com/apache/spark/blob/40a9a6ef5b89f0c3d19db4a43b8a73decaa173c3/python/pyspark/sql/dataframe.py#L860-L869 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #38735: [SPARK-41213][CONNECT][PYTHON] Implement `DataFrame.__repr__` and `DataFrame.dtypes`
zhengruifeng opened a new pull request, #38735: URL: https://github.com/apache/spark/pull/38735 ### What changes were proposed in this pull request? Implement `DataFrame.__repr__` and `DataFrame.dtypes` ### Why are the changes needed? For api coverage ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? added UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent
dongjoon-hyun commented on code in PR #38683: URL: https://github.com/apache/spark/pull/38683#discussion_r1027573603 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala: ## @@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with SharedSparkSession { } } } + + metadataColumnsTest("SPARK-41151: consistent _metadata nullability " + +"between analyzed and executed", schema) { (df, _, _) => Review Comment: That's true. We have been unclear about this. However, given that the general principle is to distinguish different sections for readability and we already use `2-space indentation` for **the method body**, I believe what we need is to extend the existing rule by removing `when the parameters don't fit in two lines`. Mixing some part of the test case name and method body doesn't give us much readability. More worse, it's not extensible because that eventually enforces us to use two-space indentation and four-space indentation separately in Case 1 and Case 2. **Case 1** ``` metadataColumnsTest("SPARK-41151: consistent _metadata nullability " + "between analyzed and executed", schema) { (df, _, _) => ``` **Case 2** ``` metadataColumnsTest("SPARK-41151: consistent _metadata nullability " + "long long long long long long long long long long long long long long " + "between analyzed and executed", schema) { (df, _, _) => ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent
dongjoon-hyun commented on code in PR #38683: URL: https://github.com/apache/spark/pull/38683#discussion_r1027573603 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala: ## @@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with SharedSparkSession { } } } + + metadataColumnsTest("SPARK-41151: consistent _metadata nullability " + +"between analyzed and executed", schema) { (df, _, _) => Review Comment: That's true. We have been unclear about this. However, given that the general principle is to distinguish different sections for readability and we already use `2-space indentation` for **the method body**, I believe what we need is to extend the existing rule by removing `when the parameters don't fit in two lines`. Mixing some part of the test case name and method body doesn't give us much readability. More worse, it's not extensible because that eventually leads us to use two-space indentation and four-space indentation in Case 1 and Case 2. **Case 1** ``` metadataColumnsTest("SPARK-41151: consistent _metadata nullability " + "between analyzed and executed", schema) { (df, _, _) => ``` **Case 2** ``` metadataColumnsTest("SPARK-41151: consistent _metadata nullability " + "long long long long long long long long long long long long long long " + "between analyzed and executed", schema) { (df, _, _) => ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent
dongjoon-hyun commented on code in PR #38683: URL: https://github.com/apache/spark/pull/38683#discussion_r1027573603 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala: ## @@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with SharedSparkSession { } } } + + metadataColumnsTest("SPARK-41151: consistent _metadata nullability " + +"between analyzed and executed", schema) { (df, _, _) => Review Comment: That's true. We have been unclear about this. Given that the general principle is to distinguish different sections for readability and we already use `2-space indentation` for *the method body*, I believe what we need is to extend the existing rule by removing `when the parameters don't fit in two lines`. Mixing some part of the test case name and method body doesn't give us much readability. More worse, it's not extensible because that eventually leads us to use two-space indentation and four-space indentation in Case 1 and Case 2. **Case 1** ``` metadataColumnsTest("SPARK-41151: consistent _metadata nullability " + "between analyzed and executed", schema) { (df, _, _) => ``` **Case 2** ``` metadataColumnsTest("SPARK-41151: consistent _metadata nullability " + "long long long long long long long long long long long long long long " + "between analyzed and executed", schema) { (df, _, _) => ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism
pan3793 commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1027559033 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala: ## @@ -119,6 +126,12 @@ class ExecutorPodsAllocator( } } snapshotsStore.addSubscriber(podAllocationDelay) { + maxNumExecutorFailuresOpt.foreach { maxNumExecutorFailures => +if (failureTracker.numFailedExecutors > maxNumExecutorFailures) { + logError(s"Max number of executor failures ($maxNumExecutorFailures) reached") + SparkContext.getActive.foreach(_.stop(EXIT_MAX_EXECUTOR_FAILURES)) Review Comment: It's ugly, but `AbstractPodsAllocator` is a developer api, pass `sc` will change the constructor signature. Suggestions are welcome. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #38734: [SPARK-41212][CONNECT][PYTHON] Implement `DataFrame.isEmpty`
zhengruifeng opened a new pull request, #38734: URL: https://github.com/apache/spark/pull/38734 ### What changes were proposed in this pull request? Implement `DataFrame.isEmpty` ### Why are the changes needed? API Coverage ### Does this PR introduce _any_ user-facing change? Yes, new api ### How was this patch tested? added UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent
HeartSaVioR commented on code in PR #38683: URL: https://github.com/apache/spark/pull/38683#discussion_r1027545869 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala: ## @@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with SharedSparkSession { } } } + + metadataColumnsTest("SPARK-41151: consistent _metadata nullability " + +"between analyzed and executed", schema) { (df, _, _) => Review Comment: https://github.com/databricks/scala-style-guide#indent For method declarations, use 4 space indentation for their parameters and put each in each line **when the parameters don't fit in two lines**. Return types can be either on the same line as the last parameter, or start a new line with 2 space indent. So the rule of 4 space indentation is for 3+ lines of definition of method. We don't specifically mention about two lines of definition of method. While it's not super clear since we don't have strict guideline, but there is a general rule of spacing `Use 2-space indentation in general.`, which I think it could apply to cases which don't fall to exceptions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent
dongjoon-hyun commented on code in PR #38683: URL: https://github.com/apache/spark/pull/38683#discussion_r1027524453 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala: ## @@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with SharedSparkSession { } } } + + metadataColumnsTest("SPARK-41151: consistent _metadata nullability " + +"between analyzed and executed", schema) { (df, _, _) => Review Comment: I'm curious why it's not a parameter? For me, the second line is a parameter because is **a part of the first parameter**. And, Apache Spark usually splits parameter definition sections from method definition sections, doesn't it? > The second line is not a parameter but a continuation of test name string. BTW, one thing I agree with @HeartSaVioR that we respect the nearest style in the code in general. So, I want to ask @Yaohua628 and @HeartSaVioR explicitly. Do you want to make this as an Apache Spark coding style officially? What I'm asking is that `the indentation on test case name splitting`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent
dongjoon-hyun commented on code in PR #38683: URL: https://github.com/apache/spark/pull/38683#discussion_r1027524453 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala: ## @@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with SharedSparkSession { } } } + + metadataColumnsTest("SPARK-41151: consistent _metadata nullability " + +"between analyzed and executed", schema) { (df, _, _) => Review Comment: I'm curious why it's not a parameter? For me, the second line is a parameter because is **a part of the first parameter**. And, Apache Spark usually splits parameter definition sections from method definition sections, doesn't it? > The second line is not a parameter but a continuation of test name string. BTW, one thing I agree with @HeartSaVioR that we respect the nearest style in the code in general. So, I want to ask @Yaohua628 and @HeartSaVioR explicitly. Do you want to make this as an Apache Spark coding style officially? What I asking is that `the indentation on test case name splitting`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism
pan3793 commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1027519725 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala: ## @@ -723,6 +723,25 @@ private[spark] object Config extends Logging { .checkValue(value => value > 0, "Maximum number of pending pods should be a positive integer") .createWithDefault(Int.MaxValue) + val KUBERNETES_MAX_EXECUTOR_FAILURES = +ConfigBuilder("spark.kubernetes.executor.maxNumFailures") Review Comment: The equivalent yarn configuration is `spark.yarn.max.executor.failures`, naming suggestion is welcome -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 commented on a diff in pull request #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism
pan3793 commented on code in PR #38732: URL: https://github.com/apache/spark/pull/38732#discussion_r1027519725 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala: ## @@ -723,6 +723,25 @@ private[spark] object Config extends Logging { .checkValue(value => value > 0, "Maximum number of pending pods should be a positive integer") .createWithDefault(Int.MaxValue) + val KUBERNETES_MAX_EXECUTOR_FAILURES = +ConfigBuilder("spark.kubernetes.executor.maxNumFailures") Review Comment: The equivalent yarn configuration is `spark.yarn.max.executor.failures` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn opened a new pull request, #38733: [SPARK-41211][Core] Upgrade ZooKeeper from 3.6.2 to 3.6.3
yaooqinn opened a new pull request, #38733: URL: https://github.com/apache/spark/pull/38733 ### What changes were proposed in this pull request? Upgrade ZooKeeper to 3.6.3 ### Why are the changes needed? ZooKeeper 3.6.3 contains many bugfixes, such as a thread leak issue described in ZOOKEEPER-3706. FYI, https://zookeeper.apache.org/doc/r3.6.3/releasenotes.html - Why 3.6.3? - https://github.com/apache/spark/pull/37507 - https://github.com/apache/spark/pull/32572 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests and dependency check -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`
zhengruifeng commented on PR #38686: URL: https://github.com/apache/spark/pull/38686#issuecomment-1321403447 I will update this PR after the arrow-based collect is fixed https://github.com/apache/spark/pull/38706 , otherwise, some e2e tests will fail -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent
HeartSaVioR commented on code in PR #38683: URL: https://github.com/apache/spark/pull/38683#discussion_r1027509435 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala: ## @@ -600,7 +600,7 @@ class FileMetadataStructSuite extends QueryTest with SharedSparkSession { val df2 = spark.read.format("json") .load(dir.getCanonicalPath + "/target/new-streaming-data-join") // Verify self-join results - assert(streamQuery2.lastProgress.numInputRows == 4L) + assert(streamQuery2.lastProgress.numInputRows == 2L) Review Comment: Off-topic: this is very interesting. Looks like fixing this "enables" ReusedExchange, which somehow makes ProgressReporter pick up the metric from the single leaf node instead of two. > Before the fix ``` == Parsed Logical Plan == WriteToMicroBatchDataSourceV1 FileSink[/private/var/folders/r0/34w92ww91n3_5htjqqx4_lrhgp/T/spark-3b56e426-a39c-4668-a51f-19bce04c2dd8/target/new-streaming-data-join], 77baa2ac-cc0b-4e01-94ff-ec20c98eb29b, [checkpointLocation=/private/var/folders/r0/34w92ww91n3_5htjqqx4_lrhgp/T/spark-3b56e426-a39c-4668-a51f-19bce04c2dd8/target/checkpoint_join, path=/private/var/folders/r0/34w92ww91n3_5htjqqx4_lrhgp/T/spark-3b56e426-a39c-4668-a51f-19bce04c2dd8/target/new-streaming-data-join], Append, 0 +- Project [name#2339, age#2340, info#2341, _metadata#2345] +- Join Inner, name#2339 = name#2504) AND (age#2340 = age#2505)) AND (info#2341 = info#2506)) AND (_metadata#2345 = _metadata#2507)) :- Project [name#2339, age#2340, info#2341, _metadata#2345] : +- Project [_metadata#2345, name#2339, age#2340, info#2341, _metadata#2345] : +- Project [name#2517 AS name#2339, age#2518 AS age#2340, info#2519 AS info#2341, _metadata#2529 AS _metadata#2345] :+- Relation [name#2517,age#2518,info#2519,_metadata#2529] json +- Project [name#2504, age#2505, info#2506, _metadata#2507] +- Project [_metadata#2507, name#2504, age#2505, info#2506, _metadata#2507] +- Project [name#2523 AS name#2504, age#2524 AS age#2505, info#2525 AS info#2506, _metadata#2530 AS _metadata#2507] +- Relation [name#2523,age#2524,info#2525,_metadata#2530] json == Analyzed Logical Plan == name: string, age: int, info: struct, _metadata: struct WriteToMicroBatchDataSourceV1 FileSink[/private/var/folders/r0/34w92ww91n3_5htjqqx4_lrhgp/T/spark-3b56e426-a39c-4668-a51f-19bce04c2dd8/target/new-streaming-data-join], 77baa2ac-cc0b-4e01-94ff-ec20c98eb29b, [checkpointLocation=/private/var/folders/r0/34w92ww91n3_5htjqqx4_lrhgp/T/spark-3b56e426-a39c-4668-a51f-19bce04c2dd8/target/checkpoint_join, path=/private/var/folders/r0/34w92ww91n3_5htjqqx4_lrhgp/T/spark-3b56e426-a39c-4668-a51f-19bce04c2dd8/target/new-streaming-data-join], Append, 0 +- Project [name#2339, age#2340, info#2341, _metadata#2345] +- Join Inner, name#2339 = name#2504) AND (age#2340 = age#2505)) AND (info#2341 = info#2506)) AND (_metadata#2345 = _metadata#2507)) :- Project [name#2339, age#2340, info#2341, _metadata#2345] : +- Project [_metadata#2345, name#2339, age#2340, info#2341, _metadata#2345] : +- Project [name#2517 AS name#2339, age#2518 AS age#2340, info#2519 AS info#2341, _metadata#2529 AS _metadata#2345] :+- Relation [name#2517,age#2518,info#2519,_metadata#2529] json +- Project [name#2504, age#2505, info#2506, _metadata#2507] +- Project [_metadata#2507, name#2504, age#2505, info#2506, _metadata#2507] +- Project [name#2523 AS name#2504, age#2524 AS age#2505, info#2525 AS info#2506, _metadata#2530 AS _metadata#2507] +- Relation [name#2523,age#2524,info#2525,_metadata#2530] json == Optimized Logical Plan == Project [name#2517, age#2518, info#2519, _metadata#2529] +- Join Inner, name#2517 = name#2523) AND (age#2518 = age#2524)) AND (info#2519 = info#2525)) AND (_metadata#2529 = _metadata#2530)) :- Filter (((isnotnull(name#2517) AND isnotnull(age#2518)) AND isnotnull(info#2519)) AND isnotnull(_metadata#2529)) : +- Relation [name#2517,age#2518,info#2519,_metadata#2529] json +- Filter (((isnotnull(name#2523) AND isnotnull(age#2524)) AND isnotnull(info#2525)) AND isnotnull(_metadata#2530)) +- Relation [name#2523,age#2524,info#2525,_metadata#2530] json == Physical Plan == *(3) Project [name#2517, age#2518, info#2519, _metadata#2529] +- StreamingSymmetricHashJoin [name#2517, age#2518, info#2519, _metadata#2529], [name#2523, age#2524, info#2525, _metadata#2530], Inner, condition = [ leftOnly = null, rightOnly = null, both = null, full = null ], state info [ checkpoint = file:/private/var/folders/r0/34w92ww91n3_5htjqqx4_lrhgp/T/spark-3b56e426-a39c-4668-a51f-19bce04c2dd8/target/checkpoint_join/state, runId
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`
zhengruifeng commented on code in PR #38686: URL: https://github.com/apache/spark/pull/38686#discussion_r1027508681 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -203,6 +204,19 @@ message Sort { } } + +// Drop specified columns. +message Drop { + // (Required) The input relation. + Relation input = 1; + + // (Required) columns to drop. + // + // Should contain at least 1 item. + repeated Expression cols = 2; Review Comment: here follows the naming in `def drop(col: Column, cols: Column*): DataFrame` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pan3793 opened a new pull request, #38732: [SPARK-41210][K8S] Window based executor failure tracking mechanism
pan3793 opened a new pull request, #38732: URL: https://github.com/apache/spark/pull/38732 ### What changes were proposed in this pull request? Fail Spark Application when executor failures reach threshold. ### Why are the changes needed? Sometimes, executor can not launch successful because of wrong configuration, but in K8s, Driver does not know that, just keep requesting new executor. This adds functionality similar to YARN[1][2] to K8s. [1] [SPARK-7451](https://issues.apache.org/jira/browse/SPARK-7451) [2] [SPARK-6735](https://issues.apache.org/jira/browse/SPARK-6735) ### Does this PR introduce _any_ user-facing change? New feature. ### How was this patch tested? UT will be added soon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`
zhengruifeng commented on code in PR #38686: URL: https://github.com/apache/spark/pull/38686#discussion_r1027507821 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -523,6 +524,19 @@ class SparkConnectPlanner(session: SparkSession) { sameOrderExpressions = Seq.empty) } + private def transformDrop(rel: proto.Drop): LogicalPlan = { +assert(rel.getColsCount > 0, s"cols must contains at least 1 item!") + +val cols = rel.getColsList.asScala.toArray.map { expr => + Column(transformExpression(expr)) Review Comment: Do you mean verify for the arrow-based collect? since we will remove the json code path, it always fails if there are unsupported types. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`
zhengruifeng commented on code in PR #38686: URL: https://github.com/apache/spark/pull/38686#discussion_r1027506855 ## python/pyspark/sql/connect/dataframe.py: ## @@ -255,10 +255,21 @@ def distinct(self) -> "DataFrame": ) def drop(self, *cols: "ColumnOrString") -> "DataFrame": Review Comment: IIUC, there will be two RPC if we implement it on the client side 1, `all_cols = self.columns` to fetch the schema; 2, build the plan with a dedicated proto mesage, we only need one RPC. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`
zhengruifeng commented on code in PR #38686: URL: https://github.com/apache/spark/pull/38686#discussion_r1027505888 ## connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala: ## @@ -148,6 +148,23 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest { comparePlans(connectPlan2, sparkPlan2) } + test("SPARK-41169: Test drop") { +// single column +val connectPlan = connectTestRelation.drop("id") +val sparkPlan = sparkTestRelation.drop("id") +comparePlans(connectPlan, sparkPlan) + +// all columns +val connectPlan2 = connectTestRelation.drop("id", "name") +val sparkPlan2 = sparkTestRelation.drop("id", "name") +comparePlans(connectPlan2, sparkPlan2) + +// non-existing column Review Comment: will add -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`
zhengruifeng commented on code in PR #38686: URL: https://github.com/apache/spark/pull/38686#discussion_r1027505120 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -203,6 +204,19 @@ message Sort { } } + +// Drop specified columns. +message Drop { + // (Required) The input relation. + Relation input = 1; + + // (Required) columns to drop. + // + // Should contain at least 1 item. + repeated Expression cols = 2; Review Comment: Dataset.drop takes arbitrary expressions into account https://github.com/apache/spark/blob/3b4faaf89ecd1b70d3d9b3c6096e1b70275670fb/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2952-L2957 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent
HeartSaVioR commented on code in PR #38683: URL: https://github.com/apache/spark/pull/38683#discussion_r1027504882 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala: ## @@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with SharedSparkSession { } } } + + metadataColumnsTest("SPARK-41151: consistent _metadata nullability " + +"between analyzed and executed", schema) { (df, _, _) => Review Comment: Looks like the indentation is consistent at least in the test suite. Please check other test cases using metadataColumnsTest - they are all 2 spaces. The second line is not a parameter but a continuation of test name string. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wankunde commented on pull request #37922: [SPARK-40480][SHUFFLE] Remove push-based shuffle data after query finished
wankunde commented on PR #37922: URL: https://github.com/apache/spark/pull/37922#issuecomment-1321392369 Hi, @mridulm , I've been working on some other issues recently. If @yabola can do all or part of this task in https://github.com/apache/spark/pull/38560, please go ahead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent
dongjoon-hyun commented on code in PR #38683: URL: https://github.com/apache/spark/pull/38683#discussion_r1027504573 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala: ## @@ -275,8 +275,13 @@ object FileSourceStrategy extends Strategy with PredicateHelper with Logging { .get.withName(FileFormat.ROW_INDEX) } } +// SPARK-41151: metadata column is not nullable for file sources, +// [[CreateNamedStruct]] is also not nullable. Review Comment: One more thing: This is misleading because this is not true in general. > [[CreateNamedStruct]] is also not nullable. To be clear in this context, it would be better to mention `CreateStruct(structColumns)` explicitly instead of saying `[[CreateNamedStruct]] is also not nullable`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent
dongjoon-hyun commented on code in PR #38683: URL: https://github.com/apache/spark/pull/38683#discussion_r1027503203 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala: ## @@ -464,11 +464,13 @@ object FileSourceMetadataAttribute { val FILE_SOURCE_METADATA_COL_ATTR_KEY = "__file_source_metadata_col" - def apply(name: String, dataType: DataType, nullable: Boolean = true): AttributeReference = -AttributeReference(name, dataType, nullable, + def apply(name: String, dataType: DataType): AttributeReference = { +// Metadata column for file sources is always not nullable. +AttributeReference(name, dataType, nullable = false, new MetadataBuilder() .putBoolean(METADATA_COL_ATTR_KEY, value = true) .putBoolean(FILE_SOURCE_METADATA_COL_ATTR_KEY, value = true).build())() + } Review Comment: Do we need to add this `{}` additionally? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38683: [SPARK-41151][SQL] Keep built-in file `_metadata` column nullable value consistent
dongjoon-hyun commented on code in PR #38683: URL: https://github.com/apache/spark/pull/38683#discussion_r1027500899 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala: ## @@ -654,4 +654,19 @@ class FileMetadataStructSuite extends QueryTest with SharedSparkSession { } } } + + metadataColumnsTest("SPARK-41151: consistent _metadata nullability " + +"between analyzed and executed", schema) { (df, _, _) => Review Comment: Indentation? We need two more spaces. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #38731: [SPARK-41209] Improve PySpark type inference in _merge_type method
sadikovi commented on PR #38731: URL: https://github.com/apache/spark/pull/38731#issuecomment-1321374183 @HyukjinKwon @xinrong-meng Could you review this PR? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi opened a new pull request, #38731: [SPARK-41209] Improve PySpark type inference in _merge_type method
sadikovi opened a new pull request, #38731: URL: https://github.com/apache/spark/pull/38731 ### What changes were proposed in this pull request? This PR updates `_merge_type` method to allow upcast from any `AtomicType` to `StringType` similar to Cast.scala (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L297). This allows us to avoid errors in the case when it is okay to infer type, for example: ```python self.spark.createDataFrame([[1.33, 1], ["2.1", 1]]) ``` However, there is an interesting side-effect that I was debating while working on this PR: ```python spark.createDataFrame([({1: True}, 1), ({"2": False}, 3)]) # Result: # {"1": true} 1 # {"2": false} 3 ``` It seems to be okay to merge map keys with different types but I would like to call it out explicitly. I also removed `NullType` from the list of atomic type because it is neither atomic nor complex type. ### Why are the changes needed? This makes the behaviour between PySpark and Arrow execution more consistent. For example, arrow can handle type upcasts while PySpark cannot. ### Does this PR introduce _any_ user-facing change? Yes, users may notice that previously examples that would fail during inference work now which could constitute a change in behaviour between Spark 3.3 and Spark 3.4. ### How was this patch tested? I updated the existing unit tests and added a couple of new ones to check that we can upcast to StringType. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun opened a new pull request, #38730: [SPARK-41181][SQL] Migrate the map options errors onto error classes
panbingkun opened a new pull request, #38730: URL: https://github.com/apache/spark/pull/38730 ### What changes were proposed in this pull request? The pr aims to migrate the map options errors onto error classes. ### Why are the changes needed? The changes improve the error framework. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mcdull-zhang commented on pull request #38703: [SPARK-41191] [SQL] Cache Table is not working while nested caches exist
mcdull-zhang commented on PR #38703: URL: https://github.com/apache/spark/pull/38703#issuecomment-1321371779 ping @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #38729: [CONNECT][INFRA] Update protobuf versions in CI
zhengruifeng opened a new pull request, #38729: URL: https://github.com/apache/spark/pull/38729 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38704: [SPARK-41193][SQL][TESTS] Ignore `collect data with single partition larger than 2GB bytes array limit` in `DatasetLargeResultC
LuciferYang commented on code in PR #38704: URL: https://github.com/apache/spark/pull/38704#discussion_r1027465227 ## sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala: ## @@ -2251,7 +2251,11 @@ class DatasetLargeResultCollectingSuite extends QueryTest with SharedSparkSession { override protected def sparkConf: SparkConf = super.sparkConf.set(MAX_RESULT_SIZE.key, "4g") - test("collect data with single partition larger than 2GB bytes array limit") { + // SPARK-41193: Ignore this suite because it cannot run successfully with Spark + // default Java Options, if user need do local test, please make the following changes: + // - Maven test: change `-Xmx4g` of `scalatest-maven-plugin` in `sql/core/pom.xml` to `-Xmx10g` + // - SBT test: change `-Xmx4g` of `Test / javaOptions` in `SparkBuild.scala` to `-Xmx10g` + ignore("collect data with single partition larger than 2GB bytes array limit") { Review Comment: So how do we move forward? This is a blocking for developers -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38718: [SPARK-41196][CONNECT][FOLLOW-UP] Fix out of sync generated files for Python
zhengruifeng commented on PR #38718: URL: https://github.com/apache/spark/pull/38718#issuecomment-1321369830 it seems that the versions in `build_and_test` are not updated? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38644: [SPARK-41130][SQL] Rename `OUT_OF_DECIMAL_TYPE_RANGE` to `NUMERIC_OUT_OF_SUPPORTED_RANGE`
itholic commented on code in PR #38644: URL: https://github.com/apache/spark/pull/38644#discussion_r1027410164 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastWithAnsiOnSuite.scala: ## @@ -244,7 +244,7 @@ class CastWithAnsiOnSuite extends CastSuiteBase with QueryErrorsBase { Decimal("12345678901234567890123456789012345678")) checkExceptionInExpression[ArithmeticException]( cast("123456789012345678901234567890123456789", DecimalType(38, 0)), - "Out of decimal type range") + "NUMERIC_OUT_OF_SUPPORTED_RANGE") Review Comment: Yes, just created JIRA here: https://issues.apache.org/jira/browse/SPARK-41208 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38728: [SPARK-41204] [CONNECT] Migrate custom exceptions to use Spark exceptions
itholic commented on code in PR #38728: URL: https://github.com/apache/spark/pull/38728#discussion_r1027408511 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -45,12 +46,18 @@ import org.apache.spark.util.Utils final case class InvalidPlanInput( private val message: String = "", private val cause: Throwable = None.orNull) -extends Exception(message, cause) +extends SparkException( + errorClass = "CONNECT.INVALID_PLAN_INPUT", + messageParameters = Map("msg" -> message), + cause = cause) final case class InvalidCommandInput( private val message: String = "", private val cause: Throwable = null) -extends Exception(message, cause) +extends SparkException( Review Comment: ditto -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on a diff in pull request #38728: [SPARK-41204] [CONNECT] Migrate custom exceptions to use Spark exceptions
itholic commented on code in PR #38728: URL: https://github.com/apache/spark/pull/38728#discussion_r1027408367 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -45,12 +46,18 @@ import org.apache.spark.util.Utils final case class InvalidPlanInput( private val message: String = "", private val cause: Throwable = None.orNull) -extends Exception(message, cause) +extends SparkException( Review Comment: Can we have test for the exception ?? If it's already exists, let's convert the existing test to use `checkError` to check error class and parameters generated properly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38718: [SPARK-41196][CONNECT][FOLLOW-UP] Fix out of sync generated files for Python
zhengruifeng commented on PR #38718: URL: https://github.com/apache/spark/pull/38718#issuecomment-1321307266 late lgtm -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] panbingkun commented on a diff in pull request #38725: [SPARK-41182][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1102
panbingkun commented on code in PR #38725: URL: https://github.com/apache/spark/pull/38725#discussion_r1027386994 ## core/src/main/resources/error/error-classes.json: ## @@ -1326,6 +1326,11 @@ "grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup" ] }, + "UNSUPPORTED_LITERAL_FOR_SOURCE_TYPE" : { +"message" : [ + "Literals of the type are currently not supported for the type." Review Comment: OK, Let me to do it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38728: [SPARK-41204] [CONNECT] Migrate custom exceptions to use Spark exceptions
HyukjinKwon commented on code in PR #38728: URL: https://github.com/apache/spark/pull/38728#discussion_r1027385407 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -45,12 +46,18 @@ import org.apache.spark.util.Utils final case class InvalidPlanInput( private val message: String = "", private val cause: Throwable = None.orNull) -extends Exception(message, cause) +extends SparkException( Review Comment: cc @MaxGekk @itholic HELP!! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38726: [SPARK-41203] [CONNECT] Support Dataframe.tansform in Python client.
HyukjinKwon closed pull request #38726: [SPARK-41203] [CONNECT] Support Dataframe.tansform in Python client. URL: https://github.com/apache/spark/pull/38726 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38726: [SPARK-41203] [CONNECT] Support Dataframe.tansform in Python client.
HyukjinKwon commented on PR #38726: URL: https://github.com/apache/spark/pull/38726#issuecomment-1321296180 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #36070: [SPARK-31675][CORE] Fix rename and delete files with different filesystem
github-actions[bot] closed pull request #36070: [SPARK-31675][CORE] Fix rename and delete files with different filesystem URL: https://github.com/apache/spark/pull/36070 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36443: [POC][WIP][SPARK-39088][CORE] Add a "live" driver link to the UI for history server when serving in-progress applications.
github-actions[bot] commented on PR #36443: URL: https://github.com/apache/spark/pull/36443#issuecomment-1321290211 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #36908: [SPARK-39510][SQL][WIP] Leverage the natural partitioning and ordering of MonotonicallyIncreasingID
github-actions[bot] closed pull request #36908: [SPARK-39510][SQL][WIP] Leverage the natural partitioning and ordering of MonotonicallyIncreasingID URL: https://github.com/apache/spark/pull/36908 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #38687: [SPARK-41154][SQL] Incorrect relation caching for queries with time travel spec
huaxingao commented on PR #38687: URL: https://github.com/apache/spark/pull/38687#issuecomment-1321287120 LGTM. Thanks for fixing this! @ulysses-you -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HuwCampbell commented on a diff in pull request #36441: [SPARK-39091][SQL] Updating specific SQL Expression traits that don't compose when multiple are extended due to nodePatterns be
HuwCampbell commented on code in PR #36441: URL: https://github.com/apache/spark/pull/36441#discussion_r973814236 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala: ## @@ -839,6 +841,52 @@ class HigherOrderFunctionsSuite extends SparkFunSuite with ExpressionEvalHelper Seq(1d, 2d, Double.NaN, null)) } + test("SPARK-39081: compatibility of combo of HigherOrderFunctions" + +" with other Expression subclasses") { +// Dummy example given in JIRA, this is to test a compile time issue only, has no real usage +case class MyExploder( + arrays: Expression,// Array[AnyDataType] + asOfDate: Expression, // LambdaFunction[AnyDataType -> TimestampType] + extractor: Expression) // TimestampType + extends HigherOrderFunction with Generator with TimeZoneAwareExpression { + + override def children: Seq[Expression] = Seq[Expression](arrays, asOfDate) + + override def withNewChildrenInternal(newChildren: IndexedSeq[Expression]): Expression = null + + override def eval(input: InternalRow): TraversableOnce[InternalRow] = null + + override def nodePatternsInternal(): Seq[TreePattern] = Seq() + + override def elementSchema: StructType = null + + override def arguments: Seq[Expression] = +Seq(arrays, asOfDate) + + override def argumentTypes: Seq[AbstractDataType] = +Seq(ArrayType, TimestampType) + + override def doGenCode(ctx: CodegenContext, exp: ExprCode): ExprCode = null + + override def functions: Seq[Expression] = +Seq(extractor) + + override def functionTypes: Seq[TimestampType] = +Seq(TimestampType) + + override def bind(f: (Expression, +Seq[(DataType, Boolean)]) => LambdaFunction): HigherOrderFunction = null + + override def timeZoneId: Option[String] = None + + override def withTimeZone(timeZoneId: String): TimeZoneAwareExpression = null +} + +/* Should not get the error - value nodePatterns + ... cannot override final member, or conflict in nodePatterns between two types Review Comment: Yeah I think you're pretty much dead on here, the correctness of the optimiser will be up to the user to enforce; as they will need to ensure that all node patterns are accounted for. But, and it's a big one, it makes it possible to compose these interfaces, even with the extra challenges... which it's just _not_ possible to do as of Spark 3.2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38714: [WIP][SPARK-41141]. avoid introducing a new aggregate expression in the analysis phase when subquery is referencing it
AmplabJenkins commented on PR #38714: URL: https://github.com/apache/spark/pull/38714#issuecomment-1321252021 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR closed pull request #38717: [SPARK-41198][SS] Fix metrics in streaming query having CTE and DSv1 streaming source
HeartSaVioR closed pull request #38717: [SPARK-41198][SS] Fix metrics in streaming query having CTE and DSv1 streaming source URL: https://github.com/apache/spark/pull/38717 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #38717: [SPARK-41198][SS] Fix metrics in streaming query having CTE and DSv1 streaming source
HeartSaVioR commented on PR #38717: URL: https://github.com/apache/spark/pull/38717#issuecomment-1321250486 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #38717: [SPARK-41198][SS] Fix metrics in streaming query having CTE and DSv1 streaming source
HeartSaVioR commented on PR #38717: URL: https://github.com/apache/spark/pull/38717#issuecomment-1321250199 Thanks for understanding. Let's go with no risk fix for now, and have more time to think about the holistic fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on pull request #38685: [SPARK-41206][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS`
srielau commented on PR #38685: URL: https://github.com/apache/spark/pull/38685#issuecomment-1321215985 > > But is COLUMN_ALREADY_EXISTS the best choice for CREATE TABLE or WITH cte(c1, c1) AS? > > How about AS T(c1, c1) > > @srielau I assumed that we will provide a query context which should point out to the problematic part. Sure, but will every tool look at it? By that token we don't need most of the payload for various errors. Either way, that is not the main point. The main question is whether we should have a distinct error messages for duplicate identifier in "constructors". We apparently allow duplicate attribute names in structs (?).. Should this say: MAP_KEY_ALREADY_EXISTS spark-sql> select map('a', 5, 'a', 6); Duplicate map key a was found I just checked CTE and table alias. Neither enforce unique names, which is curious. So I suppose the question boils down to CREATE TABLE and CREATE VIEW. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38685: [SPARK-41206][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS`
MaxGekk commented on PR #38685: URL: https://github.com/apache/spark/pull/38685#issuecomment-1321196320 > But is COLUMN_ALREADY_EXISTS the best choice for CREATE TABLE or WITH cte(c1, c1) AS? > How about AS T(c1, c1) @srielau I assumed that we will provide a query context which should point out to the problematic part. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srielau commented on pull request #38685: [SPARK-41206][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS`
srielau commented on PR #38685: URL: https://github.com/apache/spark/pull/38685#issuecomment-1321188144 I've some doubts about COLUMN_ALREADY_EXISTS when a column is duplicated within a new list. I.e. it makes a lot of sense for ALTER TABLE ADD COLUMN. But is COLUMN_ALREADY_EXISTS the best choice for CREATE TABLE or WITH cte(c1, c1) AS? How about AS T(c1, c1) I think we would like DUPLICATE_COLUMN_NAME for these case and also include the table name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38722: [SPARK-41200][CORE] BytesToBytesMap's longArray size can be up to MAX_CAPACITY
AmplabJenkins commented on PR #38722: URL: https://github.com/apache/spark/pull/38722#issuecomment-1321183990 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38723: [SPARK-41201][CONNECT][PYTHON] Implement `DataFrame.SelectExpr` in Python client
AmplabJenkins commented on PR #38723: URL: https://github.com/apache/spark/pull/38723#issuecomment-1321183964 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38725: [SPARK-41182][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1102
AmplabJenkins commented on PR #38725: URL: https://github.com/apache/spark/pull/38725#issuecomment-1321183950 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38726: [SPARK-41203] [CONNECT] Support Dataframe.tansform in Python client.
AmplabJenkins commented on PR #38726: URL: https://github.com/apache/spark/pull/38726#issuecomment-1321183941 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a diff in pull request #38725: [SPARK-41182][SQL] Assign a name to the error class _LEGACY_ERROR_TEMP_1102
MaxGekk commented on code in PR #38725: URL: https://github.com/apache/spark/pull/38725#discussion_r1027319785 ## core/src/main/resources/error/error-classes.json: ## @@ -1326,6 +1326,11 @@ "grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup" ] }, + "UNSUPPORTED_LITERAL_FOR_SOURCE_TYPE" : { +"message" : [ + "Literals of the type are currently not supported for the type." Review Comment: Looking at where the error comes from: 1. https://github.com/apache/spark/blob/51e4c2cc55aa01f07b28b1cd807b553f8729075d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/intervalExpressions.scala#L159 2. https://github.com/apache/spark/blob/b14da8b1b65d9f00f49fab87f738715089bc43e8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L2795 , the error message and error class name confuse slightly. The error is mostly about an extract field. How about error class name: `INVALID_EXTRACT_FIELD` Also cc @srielau -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38685: [SPARK-41206][SQL] Rename the error class `_LEGACY_ERROR_TEMP_1233` to `COLUMN_ALREADY_EXISTS`
MaxGekk commented on PR #38685: URL: https://github.com/apache/spark/pull/38685#issuecomment-1321178694 @srielau @LuciferYang @panbingkun @itholic @cloud-fan Could you review this PR, please. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38728: [SPARK-41204] [CONNECT] Migrate custom exceptions to use Spark exceptions
AmplabJenkins commented on PR #38728: URL: https://github.com/apache/spark/pull/38728#issuecomment-1321155240 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip opened a new pull request, #38728: [SPARK-41204] [CONNECT] Migrate custom exceptions to use Spark exceptions
grundprinzip opened a new pull request, #38728: URL: https://github.com/apache/spark/pull/38728 ### What changes were proposed in this pull request? Migrate existing custom exceptions in Spark Connect to use the proper Spark exceptions. ### Why are the changes needed? Consistency ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #38723: [SPARK-41201][CONNECT][PYTHON] Implement `DataFrame.SelectExpr` in Python client
grundprinzip commented on code in PR #38723: URL: https://github.com/apache/spark/pull/38723#discussion_r1027288685 ## python/pyspark/sql/connect/column.py: ## @@ -263,6 +263,22 @@ def __str__(self) -> str: return f"Column({self._unparsed_identifier})" +class SQLExpression(Expression): +"""Returns Expression which contains a string which is a SQL expression +and server side will parse it by Catalyst +""" + +def __init__(self, expr: str) -> None: +super().__init__() Review Comment: should we assert here that `expr` is string and not another expression? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #38723: [SPARK-41201][CONNECT][PYTHON] Implement `DataFrame.SelectExpr` in Python client
grundprinzip commented on code in PR #38723: URL: https://github.com/apache/spark/pull/38723#discussion_r1027288544 ## python/pyspark/sql/connect/dataframe.py: ## @@ -124,6 +125,29 @@ def withPlan(cls, plan: plan.LogicalPlan, session: "RemoteSparkSession") -> "Dat def select(self, *cols: "ExpressionOrString") -> "DataFrame": return DataFrame.withPlan(plan.Project(self._plan, *cols), session=self._session) +def selectExpr(self, *expr: Union[str, List[str]]) -> "DataFrame": +"""Projects a set of SQL expressions and returns a new :class:`DataFrame`. + +This is a variant of :func:`select` that accepts SQL expressions. + +.. versionadded:: 3.4.0 + +Returns +--- +:class:`DataFrame` +A DataFrame with new/old columns transformed by expressions. +""" +sql_expr = [] +if len(expr) == 1 and isinstance(expr[0], list): +expr = expr[0] # type: ignore[assignment] +for element in expr: +if isinstance(element, str): +sql_expr.append(SQLExpression(element)) +else: +sql_expr.extend([SQLExpression(e) for e in element]) + +return DataFrame.withPlan(plan.Project(self._plan, *sql_expr), session=self._session) Review Comment: This was not necessarily the question that I had, but I was not remembering correctly the type interface to Project: ``` def __init__(self, child: Optional["LogicalPlan"], *columns: "ExpressionOrString") -> None: ``` In this case `SQLExpression` is an expression and it just works. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #38726: [SPARK-41203] [CONNECT] Support Dataframe.tansform in Python client.
grundprinzip commented on code in PR #38726: URL: https://github.com/apache/spark/pull/38726#discussion_r1027276398 ## python/pyspark/sql/connect/dataframe.py: ## @@ -756,6 +757,62 @@ def schema(self) -> StructType: else: return self._schema +def transform(self, func: Callable[..., "DataFrame"], *args: Any, **kwargs: Any) -> "DataFrame": +"""Returns a new :class:`DataFrame`. Concise syntax for chaining custom transformations. + +.. versionadded:: 3.0.0 Review Comment: fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38717: [SPARK-41198][SS] Fix metrics in streaming query having CTE and DSv1 streaming source
HeartSaVioR commented on code in PR #38717: URL: https://github.com/apache/spark/pull/38717#discussion_r1027251716 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala: ## @@ -341,7 +355,13 @@ trait ProgressReporter extends Logging { val logicalPlanLeafToSource = newData.flatMap { case (source, logicalPlan) => logicalPlan.collectLeaves().map { leaf => leaf -> source } } - val allLogicalPlanLeaves = lastExecution.logical.collectLeaves() // includes non-streaming Review Comment: I admit I'm not 100% sure of the historical reason to pick up logical plan rather than optimized plan. My guess is to enable comparing the node between "logical node in newData" and "leaf node in logical plan", as optimizer may make a change against leaf node. That said, this approach is a best effort and never be a perfect one. Say, if optimizer deals with self-union into aggregation, the optimized plan will have one less leaf node, which breaks the mechanism. If optimizer swaps subtrees which end up with swapping leaf nodes, it could lead to incorrect metrics. Ideally we should either 1) move all streaming sources to DSv2 or 2) have a dedicated logical and physical node for streaming DSv1 sources, but both don't seem to be easy to achieve. Another possible idea might be assigning some UUID in the node tag for association and retain the tag even optimizer applies rules. If it could propagate the node tag to physical plan, even better. (If that is feasible, we could simply collect the nodes having node tag from executed plan and match with source.) I could probably explore the idea, but it would take time, and also I'm not sure whether the idea is feasible one or not. Do you think the idea makes sense, or it is against how Spark optimization/physical rules work? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #38708: [SPARK-41194][PROTOBUF][TESTS] Add `log4j2.properties` configuration file for `protobuf` module testing
HyukjinKwon closed pull request #38708: [SPARK-41194][PROTOBUF][TESTS] Add `log4j2.properties` configuration file for `protobuf` module testing URL: https://github.com/apache/spark/pull/38708 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38708: [SPARK-41194][PROTOBUF][TESTS] Add `log4j2.properties` configuration file for `protobuf` module testing
HyukjinKwon commented on PR #38708: URL: https://github.com/apache/spark/pull/38708#issuecomment-1321075055 Thanks @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38708: [SPARK-41194][PROTOBUF][TESTS] Add `log4j2.properties` configuration file for `protobuf` module testing
HyukjinKwon commented on PR #38708: URL: https://github.com/apache/spark/pull/38708#issuecomment-1321075019 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38717: [SPARK-41198][SS] Fix metrics in streaming query having CTE and DSv1 streaming source
HeartSaVioR commented on code in PR #38717: URL: https://github.com/apache/spark/pull/38717#discussion_r1027251960 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala: ## @@ -341,7 +355,13 @@ trait ProgressReporter extends Logging { val logicalPlanLeafToSource = newData.flatMap { case (source, logicalPlan) => logicalPlan.collectLeaves().map { leaf => leaf -> source } } - val allLogicalPlanLeaves = lastExecution.logical.collectLeaves() // includes non-streaming Review Comment: For me, chasing the metrics issue in general is "beyond the scope" as of now, although this must be something we should deal with eventually, sooner than later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38726: [SPARK-41203] [CONNECT] Support Dataframe.tansform in Python client.
HyukjinKwon commented on code in PR #38726: URL: https://github.com/apache/spark/pull/38726#discussion_r1027251817 ## python/pyspark/sql/connect/dataframe.py: ## @@ -756,6 +757,62 @@ def schema(self) -> StructType: else: return self._schema +def transform(self, func: Callable[..., "DataFrame"], *args: Any, **kwargs: Any) -> "DataFrame": +"""Returns a new :class:`DataFrame`. Concise syntax for chaining custom transformations. + +.. versionadded:: 3.0.0 + +Parameters +-- +func : function +a function that takes and returns a :class:`DataFrame`. +*args +Positional arguments to pass to func. + +.. versionadded:: 3.3.0 Review Comment: ```suggestion ``` ## python/pyspark/sql/connect/dataframe.py: ## @@ -756,6 +757,62 @@ def schema(self) -> StructType: else: return self._schema +def transform(self, func: Callable[..., "DataFrame"], *args: Any, **kwargs: Any) -> "DataFrame": +"""Returns a new :class:`DataFrame`. Concise syntax for chaining custom transformations. + +.. versionadded:: 3.0.0 + +Parameters +-- +func : function +a function that takes and returns a :class:`DataFrame`. +*args +Positional arguments to pass to func. + +.. versionadded:: 3.3.0 +**kwargs +Keyword arguments to pass to func. + +.. versionadded:: 3.3.0 Review Comment: ```suggestion ``` ## python/pyspark/sql/connect/dataframe.py: ## @@ -756,6 +757,62 @@ def schema(self) -> StructType: else: return self._schema +def transform(self, func: Callable[..., "DataFrame"], *args: Any, **kwargs: Any) -> "DataFrame": +"""Returns a new :class:`DataFrame`. Concise syntax for chaining custom transformations. + +.. versionadded:: 3.0.0 Review Comment: ```suggestion .. versionadded:: 3.3.0 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38717: [SPARK-41198][SS] Fix metrics in streaming query having CTE and DSv1 streaming source
HeartSaVioR commented on code in PR #38717: URL: https://github.com/apache/spark/pull/38717#discussion_r1027251716 ## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala: ## @@ -341,7 +355,13 @@ trait ProgressReporter extends Logging { val logicalPlanLeafToSource = newData.flatMap { case (source, logicalPlan) => logicalPlan.collectLeaves().map { leaf => leaf -> source } } - val allLogicalPlanLeaves = lastExecution.logical.collectLeaves() // includes non-streaming Review Comment: I admit I'm not 100% sure of the historical reason to pick up logical plan rather than optimized plan. My guess is to enable comparing the node between "logical node in newData" and "leaf node in logical plan", as optimizer may make a change against leaf node. That said, this approach is a best effort and never be a perfect one. Say, if optimizer deals with self-union into aggregation, the optimized plan will have one less leaf node, which breaks the mechanism. Ideally we should either 1) move all streaming sources to DSv2 or 2) have a dedicated logical and physical node for streaming DSv1 sources, but both don't seem to be easy to achieve. Another possible idea might be assigning some UUID in the node tag for association and retain the tag even optimizer applies rules. If it could propagate the node tag to physical plan, even better. (If that is feasible, we could simply collect the nodes having node tag from executed plan and match with source.) I could probably explore the idea, but it would take time, and also I'm not sure whether the idea is feasible one or not. Do you think the idea makes sense, or it is against how Spark optimization/physical rules work? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org