[GitHub] [spark] ivoson commented on pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
ivoson commented on PR #39459: URL: https://github.com/apache/spark/pull/39459#issuecomment-1445280811 Hi @mridulm comments addressed. Please take a look when you have time. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
ivoson commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1118028518 ## core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala: ## @@ -2266,6 +2270,160 @@ class BlockManagerSuite extends SparkFunSuite with Matchers with PrivateMethodTe } } + test("SPARK-41497: getOrElseUpdateRDDBlock do compute based on cache visibility statue") { +val store = makeBlockManager(8000, "executor1") +val blockId = RDDBlockId(rddId = 1, splitIndex = 1) +var computed: Boolean = false +val data = Seq(1, 2, 3) +val makeIterator = () => { + computed = true + data.iterator +} + +// Cache doesn't exist and is not visible. +assert(store.getStatus(blockId).isEmpty && !store.isRDDBlockVisible(blockId)) +val res1 = store.getOrElseUpdateRDDBlock( + 1, blockId, StorageLevel.MEMORY_ONLY, classTag[Int], makeIterator) +// Put cache successfully and reported block task info. +assert(res1.isLeft && computed) +verify(master, times(1)).updateRDDBlockTaskInfo(blockId, 1) + +// Cache exists but not visible. +computed = false +assert(store.getStatus(blockId).nonEmpty && !store.isRDDBlockVisible(blockId)) +val res2 = store.getOrElseUpdateRDDBlock( + 1, blockId, StorageLevel.MEMORY_ONLY, classTag[Int], makeIterator) +// Load cache successfully and reported block task info. +assert(res2.isLeft && computed) +verify(master, times(2)).updateRDDBlockTaskInfo(blockId, 1) Review Comment: HI @mridulm actually this is to make sure that task2blockId information will be reported only when the block is invisble as this is part of the work flow design. If this doesn't make sense, I can remove `verify` statement in a new iteration. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
ivoson commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1118028223 ## core/src/main/scala/org/apache/spark/storage/BlockManagerStorageEndpoint.scala: ## @@ -81,6 +81,8 @@ class BlockManagerStorageEndpoint( case ReplicateBlock(blockId, replicas, maxReplicas) => context.reply(blockManager.replicateBlock(blockId, replicas.toSet, maxReplicas)) +case MarkRDDBlockAsVisible(blockId) => Review Comment: Of course. Updated. ## core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala: ## @@ -2266,6 +2270,160 @@ class BlockManagerSuite extends SparkFunSuite with Matchers with PrivateMethodTe } } + test("SPARK-41497: getOrElseUpdateRDDBlock do compute based on cache visibility statue") { +val store = makeBlockManager(8000, "executor1") +val blockId = RDDBlockId(rddId = 1, splitIndex = 1) +var computed: Boolean = false +val data = Seq(1, 2, 3) +val makeIterator = () => { + computed = true + data.iterator +} + +// Cache doesn't exist and is not visible. +assert(store.getStatus(blockId).isEmpty && !store.isRDDBlockVisible(blockId)) +val res1 = store.getOrElseUpdateRDDBlock( + 1, blockId, StorageLevel.MEMORY_ONLY, classTag[Int], makeIterator) +// Put cache successfully and reported block task info. +assert(res1.isLeft && computed) +verify(master, times(1)).updateRDDBlockTaskInfo(blockId, 1) + +// Cache exists but not visible. +computed = false +assert(store.getStatus(blockId).nonEmpty && !store.isRDDBlockVisible(blockId)) Review Comment: Thanks, done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
ivoson commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1118028179 ## core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala: ## @@ -210,6 +220,65 @@ class BlockManagerMasterEndpoint( case StopBlockManagerMaster => context.reply(true) stop() + +case UpdateRDDBlockTaskInfo(blockId, taskId) => + // This is to report the information that a rdd block(with `blockId`) is computed + // and cached by task(with `taskId`). And this happens right after the task finished + // computing/caching the block only when the block is not visible yet. And the rdd + // block will be marked as visible when the corresponding task finished successfully. + context.reply(updateRDDBlockTaskInfo(blockId, taskId)) + +case GetRDDBlockVisibility(blockId) => + // Get the visibility status of a specific rdd block. + context.reply(isRDDBlockVisible(blockId)) + +case UpdateRDDBlockVisibility(taskId, visible) => + // This is to report the information that whether rdd blocks computed by task(with `taskId`) + // can be turned to be visible. This is reported by DAGScheduler right after task completes. + // If the task finished successfully, rdd blocks can be turned to be visible, otherwise rdd + // blocks' visibility status won't change. + context.reply(updateRDDBlockVisibility(taskId, visible)) + } + + private def isRDDBlockVisible(blockId: RDDBlockId): Boolean = { +if (trackingCacheVisibility) { + blockLocations.containsKey(blockId) && +blockLocations.get(blockId).nonEmpty && !invisibleRDDBlocks.contains(blockId) +} else { + // Blocks should always be visible if the feature flag is disabled. + true +} + } + + private def updateRDDBlockVisibility(taskId: Long, visible: Boolean): Unit = { +if (!trackingCacheVisibility) { + // Do nothing if the feature flag is disabled. + return +} + +if (visible) { + tidToRddBlockIds.get(taskId).foreach { blockIds => +blockIds.foreach { blockId => + invisibleRDDBlocks.remove(blockId) + // Ask block managers to update the visibility status. + val msg = MarkRDDBlockAsVisible(blockId) + getLocations(blockId).flatMap(blockManagerInfo.get).foreach { managerInfo => +managerInfo.storageEndpoint.ask[Unit](msg) + } +} + } +} Review Comment: Updated, and this is the jira: https://issues.apache.org/jira/browse/SPARK-42582 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
ivoson commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1118028046 ## core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala: ## @@ -525,6 +562,7 @@ private[storage] class BlockInfoManager extends Logging { blockInfoWrappers.clear() readLocksByTask.clear() writeLocksByTask.clear() +invisibleRDDBlocks.clear() Review Comment: Thanks, done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
ivoson commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1118028028 ## core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala: ## @@ -150,6 +150,12 @@ private[storage] class BlockInfoManager extends Logging { */ private[this] val blockInfoWrappers = new ConcurrentHashMap[BlockId, BlockInfoWrapper] + /** + * Record invisible rdd blocks stored in the block manager, entries will be removed when blocks + * are marked as visible or blocks are removed by [[removeBlock()]]. + */ + private[spark] val invisibleRDDBlocks = ConcurrentHashMap.newKeySet[RDDBlockId] Review Comment: Make sense. Thanks, updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
ivoson commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1118028001 ## core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala: ## @@ -180,6 +186,27 @@ private[storage] class BlockInfoManager extends Logging { // -- + private[spark] def isRDDBlockVisible(blockId: RDDBlockId): Boolean = { +if (trackingCacheVisibility) { + invisibleRDDBlocks.synchronized { +blockInfoWrappers.containsKey(blockId) && !invisibleRDDBlocks.contains(blockId) + } +} else { + // Always be visible if the feature flag is disabled. + true +} + } + + private[spark] def tryMarkBlockAsVisible(blockId: RDDBlockId): Unit = { +if (trackingCacheVisibility) { + invisibleRDDBlocks.synchronized { +if (blockInfoWrappers.containsKey(blockId)) { + invisibleRDDBlocks.remove(blockId) Review Comment: Thanks, you are right. Updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
ivoson commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1118027972 ## core/src/main/scala/org/apache/spark/internal/config/package.scala: ## @@ -2468,4 +2468,15 @@ package object config { .version("3.4.0") .booleanConf .createWithDefault(false) + + private[spark] val RDD_CACHE_VISIBILITY_TRACKING_ENABLED = +ConfigBuilder("spark.rdd.cache.visibilityTracking.enabled") + .internal() + .doc("Set to be true to enabled RDD cache block's visibility status. Once it's enabled," + +" a RDD cache block can be used only when it's marked as visible. And a RDD block will be" + +" marked as visible only when one of the tasks generating the cache block finished" + +" successfully. This is relevant in context of consistent accumulator status.") + .version("3.4.0") + .booleanConf + .createWithDefault(true) Review Comment: Sounds good. Thanks, updated. ## core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala: ## @@ -139,7 +139,7 @@ private[storage] object BlockInfo { * * This class is thread-safe. */ -private[storage] class BlockInfoManager extends Logging { +private[storage] class BlockInfoManager(trackingCacheVisibility: Boolean = true) extends Logging { Review Comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
mridulm commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1118023335 ## core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala: ## @@ -728,7 +800,22 @@ class BlockManagerMasterEndpoint( } if (storageLevel.isValid) { + val firstBlock = locations.isEmpty locations.add(blockManagerId) + + blockId.asRDDId.foreach { rddBlockId => +(trackingCacheVisibility, firstBlock) match { + case (true, true) => +// Mark as invisible for the first block. +invisibleRDDBlocks.add(rddBlockId) + case (true, false) if !invisibleRDDBlocks.contains(rddBlockId) => +// If the rdd block is already visible, ask storage manager to update the visibility +// status. +blockManagerInfo(blockManagerId).storageEndpoint + .ask[Unit](MarkRDDBlockAsVisible(rddBlockId)) Review Comment: Ah yes, good point. Makes sense to do this irrespective of the path to block creation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #40116: [WIP]SPARK-41391 Fix
srowen commented on PR #40116: URL: https://github.com/apache/spark/pull/40116#issuecomment-1445261458 Please fix the PR description too https://spark.apache.org/contributing.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #40116: [WIP]SPARK-41391 Fix
srowen commented on PR #40116: URL: https://github.com/apache/spark/pull/40116#issuecomment-1445261438 This is about SPARK-41391? it also doesn't contain a simple description of what you're reporting, just code snippets. I can work it out, but this could be explained in just a few sentences -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
ivoson commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1118013777 ## core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala: ## @@ -728,7 +800,22 @@ class BlockManagerMasterEndpoint( } if (storageLevel.isValid) { + val firstBlock = locations.isEmpty locations.add(blockManagerId) + + blockId.asRDDId.foreach { rddBlockId => +(trackingCacheVisibility, firstBlock) match { + case (true, true) => +// Mark as invisible for the first block. +invisibleRDDBlocks.add(rddBlockId) + case (true, false) if !invisibleRDDBlocks.contains(rddBlockId) => +// If the rdd block is already visible, ask storage manager to update the visibility +// status. +blockManagerInfo(blockManagerId).storageEndpoint + .ask[Unit](MarkRDDBlockAsVisible(rddBlockId)) Review Comment: > So ensure that the replica is also marked as visible, right ? Yes, this is one scenario. Another scenario is that once a task failed fetching the cached block from remote executor, it'll compute and cache the block again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #40115: [SPARK-42525][SQL] Collapse two adjacent windows with the same partition/order in subquery
wangyum commented on PR #40115: URL: https://github.com/apache/spark/pull/40115#issuecomment-1445259053 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum closed pull request #40115: [SPARK-42525][SQL] Collapse two adjacent windows with the same partition/order in subquery
wangyum closed pull request #40115: [SPARK-42525][SQL] Collapse two adjacent windows with the same partition/order in subquery URL: https://github.com/apache/spark/pull/40115 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell opened a new pull request, #40175: [SPARK-42580][CONNECT] Scala client add client side typed APIs
hvanhovell opened a new pull request, #40175: URL: https://github.com/apache/spark/pull/40175 ### What changes were proposed in this pull request? This PR adds the client side typed API to the Spark Connect Scala Client. ### Why are the changes needed? We want to reach API parity with the existing APIs. ### Does this PR introduce _any_ user-facing change? Yes, it adds user API. ### How was this patch tested? Added tests to `ClientE2ETestSuite`, and updated existing tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia closed pull request #40174: [SPARK-42573][CONNECT][FOLLOW-UP] fix broken build after variable rename
amaliujia closed pull request #40174: [SPARK-42573][CONNECT][FOLLOW-UP] fix broken build after variable rename URL: https://github.com/apache/spark/pull/40174 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #40093: [SPARK-42500][SQL] ConstantPropagation supports more cases
wangyum commented on PR #40093: URL: https://github.com/apache/spark/pull/40093#issuecomment-1445252943 TiDB also supports these optimizations: https://user-images.githubusercontent.com/5399861/221389051-80dbc027-fc94-4610-bb72-f7514abf276b.png;> https://user-images.githubusercontent.com/5399861/221389064-8598c7ac-d472-4ec8-a02a-25d96b44e4da.png;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell closed pull request #40173: [SPARK-42576][CONNECT] Add 2nd groupBy method to Dataset
hvanhovell closed pull request #40173: [SPARK-42576][CONNECT] Add 2nd groupBy method to Dataset URL: https://github.com/apache/spark/pull/40173 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on pull request #40173: [SPARK-42576][CONNECT] Add 2nd groupBy method to Dataset
hvanhovell commented on PR #40173: URL: https://github.com/apache/spark/pull/40173#issuecomment-1445252735 Merging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
ivoson commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1118005755 ## core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala: ## @@ -399,7 +426,14 @@ private[storage] class BlockInfoManager extends Logging { try { val wrapper = new BlockInfoWrapper(newBlockInfo, lock) while (true) { -val previous = blockInfoWrappers.putIfAbsent(blockId, wrapper) +val previous = invisibleRDDBlocks.synchronized { + val res = blockInfoWrappers.putIfAbsent(blockId, wrapper) + if (res == null && trackingCacheVisibility) { +// Added to invisible blocks if it doesn't exist before. +blockId.asRDDId.foreach(invisibleRDDBlocks.add) Review Comment: Currently we syncrhonized all the write operations, and also the read operations which need to check both the variables. For other read operations, since the variabled won't change, I think synchronized block for such read operations may be not necessary. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40174: [SPARK-42573][CONNECT][FOLLOW-UP] fix broken build after variable rename
amaliujia commented on PR #40174: URL: https://github.com/apache/spark/pull/40174#issuecomment-1445246195 Preferring https://github.com/apache/spark/pull/40173 over this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40173: [SPARK-42576][CONNECT] Add 2nd groupBy method to Dataset
amaliujia commented on code in PR #40173: URL: https://github.com/apache/spark/pull/40173#discussion_r1118005633 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -1990,14 +2019,14 @@ class Dataset[T] private[sql] (val sparkSession: SparkSession, private[sql] val viewName: String, replace: Boolean, global: Boolean): Unit = { -val command = session.newCommand { builder => +val command = sparkSession.newCommand { builder => Review Comment: Or we use this PR to fix the build, depending on you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #40173: [SPARK-42576][CONNECT] Add 2nd groupBy method to Dataset
amaliujia commented on code in PR #40173: URL: https://github.com/apache/spark/pull/40173#discussion_r1118005391 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -1990,14 +2019,14 @@ class Dataset[T] private[sql] (val sparkSession: SparkSession, private[sql] val viewName: String, replace: Boolean, global: Boolean): Unit = { -val command = session.newCommand { builder => +val command = sparkSession.newCommand { builder => Review Comment: https://github.com/apache/spark/pull/40174 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40174: [SPARK-42573][CONNECT][FOLLOW-UP] fix broken build after variable rename
amaliujia commented on PR #40174: URL: https://github.com/apache/spark/pull/40174#issuecomment-1445245652 @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #40174: [SPARK-42573][CONNECT][FOLLOW-UP] fix broken build after variable rename
amaliujia opened a new pull request, #40174: URL: https://github.com/apache/spark/pull/40174 ### What changes were proposed in this pull request? Fix BUILD caused by https://github.com/apache/spark/pull/40168. ### Why are the changes needed? Fix BUILD ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #40173: [SPARK-42576][CONNECT] Add 2nd groupBy method to Dataset
hvanhovell commented on code in PR #40173: URL: https://github.com/apache/spark/pull/40173#discussion_r1118002760 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -1990,14 +2019,14 @@ class Dataset[T] private[sql] (val sparkSession: SparkSession, private[sql] val viewName: String, replace: Boolean, global: Boolean): Unit = { -val command = session.newCommand { builder => +val command = sparkSession.newCommand { builder => Review Comment: Ah yes, I guess I broke the build... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache
mridulm commented on code in PR #39459: URL: https://github.com/apache/spark/pull/39459#discussion_r1117987993 ## core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala: ## @@ -180,6 +186,27 @@ private[storage] class BlockInfoManager extends Logging { // -- + private[spark] def isRDDBlockVisible(blockId: RDDBlockId): Boolean = { +if (trackingCacheVisibility) { + invisibleRDDBlocks.synchronized { +blockInfoWrappers.containsKey(blockId) && !invisibleRDDBlocks.contains(blockId) + } +} else { + // Always be visible if the feature flag is disabled. + true +} + } + + private[spark] def tryMarkBlockAsVisible(blockId: RDDBlockId): Unit = { +if (trackingCacheVisibility) { + invisibleRDDBlocks.synchronized { +if (blockInfoWrappers.containsKey(blockId)) { + invisibleRDDBlocks.remove(blockId) Review Comment: We dont need the `containsKey` check ... ## core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala: ## @@ -150,6 +150,12 @@ private[storage] class BlockInfoManager extends Logging { */ private[this] val blockInfoWrappers = new ConcurrentHashMap[BlockId, BlockInfoWrapper] + /** + * Record invisible rdd blocks stored in the block manager, entries will be removed when blocks + * are marked as visible or blocks are removed by [[removeBlock()]]. + */ + private[spark] val invisibleRDDBlocks = ConcurrentHashMap.newKeySet[RDDBlockId] Review Comment: All accesses and updates to `invisibleRDDBlocks` are within the `synchronized` block - make it a regular `HashSet` instead ? Also, let us make it `private[this]`. For the test cases which are directly accessing variable right now, we can add a ``` private[storage] def containsInvisibleRDDBlock(blockId: RDDBlockId): Boolean = { invisibleRDDBlocks.synchronized { invisibleRDDBlocks.contains(blockId) } } ``` ## core/src/main/scala/org/apache/spark/storage/BlockManager.scala: ## @@ -1325,31 +1328,74 @@ private[spark] class BlockManager( blockInfoManager.releaseAllLocksForTask(taskAttemptId) } + /** + * Retrieve the given rdd block if it exists and is visible, otherwise call the provided + * `makeIterator` method to compute the block, persist it, and return its values. + * + * @return either a BlockResult if the block was successfully cached, or an iterator if the block + * could not be cached. + */ + def getOrElseUpdateRDDBlock[T]( + taskId: Long, + blockId: RDDBlockId, + level: StorageLevel, + classTag: ClassTag[T], + makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = { +val isCacheVisible = isRDDBlockVisible(blockId) +val res = getOrElseUpdate(blockId, level, classTag, makeIterator, isCacheVisible) +if (res.isLeft && !isCacheVisible) { + // Block exists but not visible, report taskId -> blockId info to master. + master.updateRDDBlockTaskInfo(blockId, taskId) +} + +res + } + /** * Retrieve the given block if it exists, otherwise call the provided `makeIterator` method * to compute the block, persist it, and return its values. * * @return either a BlockResult if the block was successfully cached, or an iterator if the block * could not be cached. */ - def getOrElseUpdate[T]( + private def getOrElseUpdate[T]( blockId: BlockId, level: StorageLevel, classTag: ClassTag[T], - makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = { -// Attempt to read the block from local or remote storage. If it's present, then we don't need -// to go through the local-get-or-put path. -get[T](blockId)(classTag) match { - case Some(block) => -return Left(block) - case _ => -// Need to compute the block. + makeIterator: () => Iterator[T], + isCacheVisible: Boolean = true): Either[BlockResult, Iterator[T]] = { Review Comment: Remove default value for `isCacheVisible` ## core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala: ## @@ -2266,6 +2270,160 @@ class BlockManagerSuite extends SparkFunSuite with Matchers with PrivateMethodTe } } + test("SPARK-41497: getOrElseUpdateRDDBlock do compute based on cache visibility statue") { +val store = makeBlockManager(8000, "executor1") +val blockId = RDDBlockId(rddId = 1, splitIndex = 1) +var computed: Boolean = false +val data = Seq(1, 2, 3) +val makeIterator = () => { + computed = true + data.iterator +} + +// Cache doesn't exist and is not visible. +assert(store.getStatus(blockId).isEmpty && !store.isRDDBlockVisible(blockId)) +val res1 = store.getOrElseUpdateRDDBlock( + 1, blockId,
[GitHub] [spark] github-actions[bot] closed pull request #38464: [SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning
github-actions[bot] closed pull request #38464: [SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning URL: https://github.com/apache/spark/pull/38464 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ritikam2 commented on pull request #40116: [WIP]SPARK-41391 Fix
ritikam2 commented on PR #40116: URL: https://github.com/apache/spark/pull/40116#issuecomment-1445233793 Sean not sure which issue you were referring to. I updated the why the changes are needed section of the pull request to mirror what Zheng had already put in his pull request. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40173: [SPARK-42576][CONNECT] Add 2nd groupBy method to Dataset
amaliujia commented on PR #40173: URL: https://github.com/apache/spark/pull/40173#issuecomment-1445232248 @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #40173: [SPARK-42576][CONNECT] Add 2nd groupBy method to Dataset
amaliujia opened a new pull request, #40173: URL: https://github.com/apache/spark/pull/40173 ### What changes were proposed in this pull request? Add `groupBy(col1: String, cols: String*)` to Scala client Dataset API. ### Why are the changes needed? API coverage ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ritikam2 commented on pull request #40116: [WIP]SPARK-41391 Fix
ritikam2 commented on PR #40116: URL: https://github.com/apache/spark/pull/40116#issuecomment-1445232031 I have enabled the workflows on the branch. Is there something else that I need to do? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #40172: [SPARK-42569][CONNECT][FOLLOW-UP] Throw unsupported exceptions for persist
amaliujia commented on PR #40172: URL: https://github.com/apache/spark/pull/40172#issuecomment-1445228075 @hvanhovell -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #40172: [SPARK-42569][CONNECT][FOLLOW-UP] Throw unsupported exceptions for persist
amaliujia opened a new pull request, #40172: URL: https://github.com/apache/spark/pull/40172 ### What changes were proposed in this pull request? Follow up https://github.com/apache/spark/pull/40164 to also throw unsupported operation exception for `persist`. Right now we are ok to depends on the `StorageLevel` in core module but in the future that shall be refactored and moved to a common module. ### Why are the changes needed? Better way to indicate a non-supported API. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? N/A -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] joaoleveiga commented on pull request #37817: [SPARK-40376][PYTHON] Avoid Numpy deprecation warning
joaoleveiga commented on PR #37817: URL: https://github.com/apache/spark/pull/37817#issuecomment-1445218024 > Also merged to 3.3 Thank you so much! Here I was assuming I would pick up this thread on monday but you delivered it Cheers -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] santosh-d3vpl3x commented on pull request #40122: [SPARK-42349][PYTHON] Support pandas cogroup with multiple df
santosh-d3vpl3x commented on PR #40122: URL: https://github.com/apache/spark/pull/40122#issuecomment-1445208101 > qq, > > > this is a breaking change to this experimental API. > > What's breaking? would be good to keep the PR desc template (https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE) @HyukjinKwon I have changed the description to include the breaking change and some more details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aimtsou commented on pull request #37817: [SPARK-40376][PYTHON] Avoid Numpy deprecation warning
aimtsou commented on PR #37817: URL: https://github.com/apache/spark/pull/37817#issuecomment-1445189813 Thank you @srowen, really appreciated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #40169: [SPARK-42575][Connect][Scala] Make all client tests to extend from ConnectFunSuite
dongjoon-hyun commented on PR #40169: URL: https://github.com/apache/spark/pull/40169#issuecomment-1445189613 > Merging. FYI - we are planning a refactor of Catalyst soon (post 3.4) and then we will integrate this with Spark's exception and error framework. Thank you for sharing the future plan, @hvanhovell ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell closed pull request #40166: [SPARK-42570][CONNECT][PYTHON] Fix DataFrameReader to use the default source
hvanhovell closed pull request #40166: [SPARK-42570][CONNECT][PYTHON] Fix DataFrameReader to use the default source URL: https://github.com/apache/spark/pull/40166 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on pull request #40166: [SPARK-42570][CONNECT][PYTHON] Fix DataFrameReader to use the default source
hvanhovell commented on PR #40166: URL: https://github.com/apache/spark/pull/40166#issuecomment-1445173286 Merging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell closed pull request #40143: [SPARK-42538][CONNECT] Make `sql.functions#lit` function support more types
hvanhovell closed pull request #40143: [SPARK-42538][CONNECT] Make `sql.functions#lit` function support more types URL: https://github.com/apache/spark/pull/40143 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on pull request #40143: [SPARK-42538][CONNECT] Make `sql.functions#lit` function support more types
hvanhovell commented on PR #40143: URL: https://github.com/apache/spark/pull/40143#issuecomment-1445172271 Merging to master/3.4. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell closed pull request #40164: [SPARK-42569][CONNECT] Throw unsupported exceptions for non-supported API
hvanhovell closed pull request #40164: [SPARK-42569][CONNECT] Throw unsupported exceptions for non-supported API URL: https://github.com/apache/spark/pull/40164 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on pull request #40164: [SPARK-42569][CONNECT] Throw unsupported exceptions for non-supported API
hvanhovell commented on PR #40164: URL: https://github.com/apache/spark/pull/40164#issuecomment-1445171880 I am merging this one. Can you do persist in a follow-up? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on a diff in pull request #40164: [SPARK-42569][CONNECT] Throw unsupported exceptions for non-supported API
hvanhovell commented on code in PR #40164: URL: https://github.com/apache/spark/pull/40164#discussion_r1117957143 ## connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala: ## @@ -2461,6 +2461,60 @@ class Dataset[T] private[sql] (val session: SparkSession, private[sql] val plan: new DataFrameWriterV2[T](table, this) } + def unpersist(blocking: Boolean): this.type = { Review Comment: Yeah, we can use the storage enum. We will move the enum to common/util later on. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell closed pull request #40167: [SPARK-42561][CONNECT] Add temp view API to Dataset
hvanhovell closed pull request #40167: [SPARK-42561][CONNECT] Add temp view API to Dataset URL: https://github.com/apache/spark/pull/40167 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on pull request #40167: [SPARK-42561][CONNECT] Add temp view API to Dataset
hvanhovell commented on PR #40167: URL: https://github.com/apache/spark/pull/40167#issuecomment-1445171335 Merging. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell closed pull request #40169: [SPARK-42575][Connect][Scala] Make all client tests to extend from ConnectFunSuite
hvanhovell closed pull request #40169: [SPARK-42575][Connect][Scala] Make all client tests to extend from ConnectFunSuite URL: https://github.com/apache/spark/pull/40169 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on pull request #40169: [SPARK-42575][Connect][Scala] Make all client tests to extend from ConnectFunSuite
hvanhovell commented on PR #40169: URL: https://github.com/apache/spark/pull/40169#issuecomment-1445171047 Merging. FYI - we are planning a refactor of Catalyst soon (post 3.4) and then we will integrate this with Spark's exception and error framework. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell closed pull request #40168: [SPARK-42573][Connect][Scala] Enable binary compatibility tests on all major client APIs
hvanhovell closed pull request #40168: [SPARK-42573][Connect][Scala] Enable binary compatibility tests on all major client APIs URL: https://github.com/apache/spark/pull/40168 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell commented on pull request #40168: [SPARK-42573][Connect][Scala] Enable binary compatibility tests on all major client APIs
hvanhovell commented on PR #40168: URL: https://github.com/apache/spark/pull/40168#issuecomment-1445170286 Merging, thanks for doing this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bjornjorgensen commented on pull request #40171: [Tests] Refactor TPCH schema to separate file similar to TPCDS for code reuse
bjornjorgensen commented on PR #40171: URL: https://github.com/apache/spark/pull/40171#issuecomment-1445168808 [priv...@spark.apache.org](mailto:priv...@spark.apache.org) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #37817: [SPARK-40376][PYTHON] Avoid Numpy deprecation warning
srowen commented on PR #37817: URL: https://github.com/apache/spark/pull/37817#issuecomment-1445157668 Also merged to 3.3 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Kilo59 commented on pull request #29591: [SPARK-32714][PYTHON] Initial pyspark-stubs port.
Kilo59 commented on PR #29591: URL: https://github.com/apache/spark/pull/29591#issuecomment-1445147815 Has anyone solved the problem of trying to type-check pyspark code without installing the 200+MB pyspark package? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #40116: [WIP]SPARK-41391 Fix
srowen commented on PR #40116: URL: https://github.com/apache/spark/pull/40116#issuecomment-1445132726 Eh, this does not explain the issue at all. Please do so. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vicennial commented on a diff in pull request #40147: [SPARK-42543][CONNECT] Specify protocol for UDF artifact transfer in JVM/Scala client
vicennial commented on code in PR #40147: URL: https://github.com/apache/spark/pull/40147#discussion_r1117931152 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -183,6 +183,87 @@ message ExecutePlanResponse { } } +// Request to transfer client-local artifacts. +message AddArtifactsRequest { + + // Definition of an Artifact. + message Artifact { +// The name of the artifact is expected in the form of a "Relative Path" that is made up of a +// sequence of directories and the final file element. +// Examples of "Relative Path"s: "jars/test.jar", "classes/xyz.class", "abc.xyz", "a/b/X.jar". +// The server is expected to maintain the hierarchy of files as defined by their name. (i.e +// The relative path of the file on the server's filesystem will be the same as the name of +// the provided artifact) +string name = 1; +// Raw data. +bytes data = 2; +// CRC to allow server to verify integrity of the artifact. +int64 crc = 3; Review Comment: Yes, that makes sense but I'm wondering how we should proceed with the naming since the naming `Aritifact` should technically represent any artifact regardless of size. A proposal for the naming: Name the smallest unit of data as `ArtifactChunk` ``` message ArtifactChunk { bytes data = 1; int64 crc = 2; } ``` Use the name `SingleAritifact` (or perhaps `CompactArtifact`) instead of `Artifact` ``` message SingleArtifact { string name = 1; ArtifactChunk data = 2; } ``` Rename `BeginChunkedArtifact` to `ChunkedArtifact` ``` message ChunkedArtifact { string name = 1; int64 total_bytes = 2; int64 num_chunks = 3; ArtifactChunk initial_chunk = 4; } ``` Now we can define an "Artifact" (verbally) as either a {`SingleArtifact`} or {`ChunkedAritfact` + it's appended `ArtifactChunks`}. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vicennial commented on a diff in pull request #40147: [SPARK-42543][CONNECT] Specify protocol for UDF artifact transfer in JVM/Scala client
vicennial commented on code in PR #40147: URL: https://github.com/apache/spark/pull/40147#discussion_r1117927851 ## connector/connect/common/src/main/protobuf/spark/connect/base.proto: ## @@ -183,6 +183,60 @@ message ExecutePlanResponse { } } +// Request to transfer client-local artifacts. +message AddArtifactsRequest { + + // Definition of an Artifact. + message Artifact { +// The name of the artifact is expected in the form of a "Relative Path" that is made up of a +// sequence of directories and the final file element. +// Examples of "Relative Path"s: "jars/test.jar", "classes/xyz.class", "abc.xyz", "a/b/X.jar". +// The server is expected to maintain the hierarchy of files as defined by their name. (i.e +// The relative path of the file on the server's filesystem will be the same as the name of +// the provided artifact) +string name = 1; +// Raw data. +bytes data = 2; + } + + // A number of small artifacts batched into a single RPC. + message Batch { +repeated Artifact artifacts = 1; + } + + // The client_id is set by the client to be able to collate streaming responses from + // different queries. + string client_id = 1; + + // User context + UserContext user_context = 2; + + // The payload is either a batch of artifacts or a partial chunk of a large artifact. + oneof payload { +Batch batch = 3; +// A large artifact chunked into multiple requests. The server side should assume that the +// artifact has been completely uploaded either when it encounters a new artifact name, or +// when the the stream is completed. +Artifact chunk = 4; Review Comment: > Do we also need EndChunkedArtifact? The `BeginChunkedArtifact` contains some metadata but I'm not sure what other data we can include in `EndChunkedArtifact` that's also not available in `BeginChunkedArtifact` so I've left it out in order to keep message types minimal. However, no strong preference for either way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AlanBateman commented on pull request #39909: [SPARK-42369][CORE] Fix constructor for java.nio.DirectByteBuffer
AlanBateman commented on PR #39909: URL: https://github.com/apache/spark/pull/39909#issuecomment-1445042515 It's the same issue for JDK 8 where it was completely unsupported to be make use of the JDK private constructor. For the JDK side, all we can do it strongly courage the Spark project to move away from these hacks. If there are compelling cases for introducing new APIs or interfaces then please bring them to the OpenJDK mailing lists for discussion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR closed pull request #40163: [SPARK-42567][SS][SQL] Track load time for state store provider and log warning if it exceeds threshold
HeartSaVioR closed pull request #40163: [SPARK-42567][SS][SQL] Track load time for state store provider and log warning if it exceeds threshold URL: https://github.com/apache/spark/pull/40163 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #40163: [SPARK-42567][SS][SQL] Track load time for state store provider and log warning if it exceeds threshold
HeartSaVioR commented on PR #40163: URL: https://github.com/apache/spark/pull/40163#issuecomment-1445041909 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR closed pull request #40162: [SPARK-42566][SS] RocksDB StateStore lock acquisition should happen after getting input iterator from inputRDD
HeartSaVioR closed pull request #40162: [SPARK-42566][SS] RocksDB StateStore lock acquisition should happen after getting input iterator from inputRDD URL: https://github.com/apache/spark/pull/40162 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #40162: [SPARK-42566][SS] RocksDB StateStore lock acquisition should happen after getting input iterator from inputRDD
HeartSaVioR commented on PR #40162: URL: https://github.com/apache/spark/pull/40162#issuecomment-1445041550 Thanks! Merging to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Surbhi-Vijay commented on pull request #40171: [Tests] Refactor TPCH schema to separate file similar to TPCDS for code reuse
Surbhi-Vijay commented on PR #40171: URL: https://github.com/apache/spark/pull/40171#issuecomment-1445041525 Can someone please help in creating issue for this pull request ? I do not have a ASF Jira account. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Surbhi-Vijay opened a new pull request, #40171: [Tests] Refactor TPCH schema to separate file similar to TPCDS for code reuse
Surbhi-Vijay opened a new pull request, #40171: URL: https://github.com/apache/spark/pull/40171 ### What changes were proposed in this pull request? Changes are only in tests files. - Refactored the `TPCHBase` class and created `TPCHSchema` similar to TPCDS. - Created a base class `TPCSchema` which is extended by `TPCDSSchema` and `TPCHSchema`. ### Why are the changes needed? This PR just refactors the code. Going forward it will help in making code changes at just one place to reflect in both TPCDS and TPCH. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ran the PlanStabilitySuite to verify the changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org