[GitHub] [spark] ivoson commented on pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-25 Thread via GitHub


ivoson commented on PR #39459:
URL: https://github.com/apache/spark/pull/39459#issuecomment-1445280811

   Hi @mridulm  comments addressed. Please take a look when you have time. 
Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-25 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1118028518


##
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala:
##
@@ -2266,6 +2270,160 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with PrivateMethodTe
 }
   }
 
+  test("SPARK-41497: getOrElseUpdateRDDBlock do compute based on cache 
visibility statue") {
+val store = makeBlockManager(8000, "executor1")
+val blockId = RDDBlockId(rddId = 1, splitIndex = 1)
+var computed: Boolean = false
+val data = Seq(1, 2, 3)
+val makeIterator = () => {
+  computed = true
+  data.iterator
+}
+
+// Cache doesn't exist and is not visible.
+assert(store.getStatus(blockId).isEmpty && 
!store.isRDDBlockVisible(blockId))
+val res1 = store.getOrElseUpdateRDDBlock(
+  1, blockId, StorageLevel.MEMORY_ONLY, classTag[Int], makeIterator)
+// Put cache successfully and reported block task info.
+assert(res1.isLeft && computed)
+verify(master, times(1)).updateRDDBlockTaskInfo(blockId, 1)
+
+// Cache exists but not visible.
+computed = false
+assert(store.getStatus(blockId).nonEmpty && 
!store.isRDDBlockVisible(blockId))
+val res2 = store.getOrElseUpdateRDDBlock(
+  1, blockId, StorageLevel.MEMORY_ONLY, classTag[Int], makeIterator)
+// Load cache successfully and reported block task info.
+assert(res2.isLeft && computed)
+verify(master, times(2)).updateRDDBlockTaskInfo(blockId, 1)

Review Comment:
   HI @mridulm actually this is to make sure that task2blockId information will 
be reported only when the block is invisble as this is part of the work flow 
design. If this doesn't make sense, I can remove `verify` statement in a new 
iteration. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-25 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1118028223


##
core/src/main/scala/org/apache/spark/storage/BlockManagerStorageEndpoint.scala:
##
@@ -81,6 +81,8 @@ class BlockManagerStorageEndpoint(
 case ReplicateBlock(blockId, replicas, maxReplicas) =>
   context.reply(blockManager.replicateBlock(blockId, replicas.toSet, 
maxReplicas))
 
+case MarkRDDBlockAsVisible(blockId) =>

Review Comment:
   Of course. Updated.



##
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala:
##
@@ -2266,6 +2270,160 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with PrivateMethodTe
 }
   }
 
+  test("SPARK-41497: getOrElseUpdateRDDBlock do compute based on cache 
visibility statue") {
+val store = makeBlockManager(8000, "executor1")
+val blockId = RDDBlockId(rddId = 1, splitIndex = 1)
+var computed: Boolean = false
+val data = Seq(1, 2, 3)
+val makeIterator = () => {
+  computed = true
+  data.iterator
+}
+
+// Cache doesn't exist and is not visible.
+assert(store.getStatus(blockId).isEmpty && 
!store.isRDDBlockVisible(blockId))
+val res1 = store.getOrElseUpdateRDDBlock(
+  1, blockId, StorageLevel.MEMORY_ONLY, classTag[Int], makeIterator)
+// Put cache successfully and reported block task info.
+assert(res1.isLeft && computed)
+verify(master, times(1)).updateRDDBlockTaskInfo(blockId, 1)
+
+// Cache exists but not visible.
+computed = false
+assert(store.getStatus(blockId).nonEmpty && 
!store.isRDDBlockVisible(blockId))

Review Comment:
   Thanks, done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-25 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1118028179


##
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##
@@ -210,6 +220,65 @@ class BlockManagerMasterEndpoint(
 case StopBlockManagerMaster =>
   context.reply(true)
   stop()
+
+case UpdateRDDBlockTaskInfo(blockId, taskId) =>
+  // This is to report the information that a rdd block(with `blockId`) is 
computed
+  // and cached by task(with `taskId`). And this happens right after the 
task finished
+  // computing/caching the block only when the block is not visible yet. 
And the rdd
+  // block will be marked as visible when the corresponding task finished 
successfully.
+  context.reply(updateRDDBlockTaskInfo(blockId, taskId))
+
+case GetRDDBlockVisibility(blockId) =>
+  // Get the visibility status of a specific rdd block.
+  context.reply(isRDDBlockVisible(blockId))
+
+case UpdateRDDBlockVisibility(taskId, visible) =>
+  // This is to report the information that whether rdd blocks computed by 
task(with `taskId`)
+  // can be turned to be visible. This is reported by DAGScheduler right 
after task completes.
+  // If the task finished successfully, rdd blocks can be turned to be 
visible, otherwise rdd
+  // blocks' visibility status won't change.
+  context.reply(updateRDDBlockVisibility(taskId, visible))
+  }
+
+  private def isRDDBlockVisible(blockId: RDDBlockId): Boolean = {
+if (trackingCacheVisibility) {
+  blockLocations.containsKey(blockId) &&
+blockLocations.get(blockId).nonEmpty && 
!invisibleRDDBlocks.contains(blockId)
+} else {
+  // Blocks should always be visible if the feature flag is disabled.
+  true
+}
+  }
+
+  private def updateRDDBlockVisibility(taskId: Long, visible: Boolean): Unit = 
{
+if (!trackingCacheVisibility) {
+  // Do nothing if the feature flag is disabled.
+  return
+}
+
+if (visible) {
+  tidToRddBlockIds.get(taskId).foreach { blockIds =>
+blockIds.foreach { blockId =>
+  invisibleRDDBlocks.remove(blockId)
+  // Ask block managers to update the visibility status.
+  val msg = MarkRDDBlockAsVisible(blockId)
+  getLocations(blockId).flatMap(blockManagerInfo.get).foreach { 
managerInfo =>
+managerInfo.storageEndpoint.ask[Unit](msg)
+  }
+}
+  }
+}

Review Comment:
   Updated, and this is the jira: 
https://issues.apache.org/jira/browse/SPARK-42582



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-25 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1118028046


##
core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala:
##
@@ -525,6 +562,7 @@ private[storage] class BlockInfoManager extends Logging {
 blockInfoWrappers.clear()
 readLocksByTask.clear()
 writeLocksByTask.clear()
+invisibleRDDBlocks.clear()

Review Comment:
   Thanks, done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-25 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1118028028


##
core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala:
##
@@ -150,6 +150,12 @@ private[storage] class BlockInfoManager extends Logging {
*/
   private[this] val blockInfoWrappers = new ConcurrentHashMap[BlockId, 
BlockInfoWrapper]
 
+  /**
+   * Record invisible rdd blocks stored in the block manager, entries will be 
removed when blocks
+   * are marked as visible or blocks are removed by [[removeBlock()]].
+   */
+  private[spark] val invisibleRDDBlocks = 
ConcurrentHashMap.newKeySet[RDDBlockId]

Review Comment:
   Make sense. Thanks, updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-25 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1118028001


##
core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala:
##
@@ -180,6 +186,27 @@ private[storage] class BlockInfoManager extends Logging {
 
   // 
--
 
+  private[spark] def isRDDBlockVisible(blockId: RDDBlockId): Boolean = {
+if (trackingCacheVisibility) {
+  invisibleRDDBlocks.synchronized {
+blockInfoWrappers.containsKey(blockId) && 
!invisibleRDDBlocks.contains(blockId)
+  }
+} else {
+  // Always be visible if the feature flag is disabled.
+  true
+}
+  }
+
+  private[spark] def tryMarkBlockAsVisible(blockId: RDDBlockId): Unit = {
+if (trackingCacheVisibility) {
+  invisibleRDDBlocks.synchronized {
+if (blockInfoWrappers.containsKey(blockId)) {
+  invisibleRDDBlocks.remove(blockId)

Review Comment:
   Thanks, you are right. Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-25 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1118027972


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -2468,4 +2468,15 @@ package object config {
   .version("3.4.0")
   .booleanConf
   .createWithDefault(false)
+
+  private[spark] val RDD_CACHE_VISIBILITY_TRACKING_ENABLED =
+ConfigBuilder("spark.rdd.cache.visibilityTracking.enabled")
+  .internal()
+  .doc("Set to be true to enabled RDD cache block's visibility status. 
Once it's enabled," +
+" a RDD cache block can be used only when it's marked as visible. And 
a RDD block will be" +
+" marked as visible only when one of the tasks generating the cache 
block finished" +
+" successfully. This is relevant in context of consistent accumulator 
status.")
+  .version("3.4.0")
+  .booleanConf
+  .createWithDefault(true)

Review Comment:
   Sounds good. Thanks, updated.



##
core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala:
##
@@ -139,7 +139,7 @@ private[storage] object BlockInfo {
  *
  * This class is thread-safe.
  */
-private[storage] class BlockInfoManager extends Logging {
+private[storage] class BlockInfoManager(trackingCacheVisibility: Boolean = 
true) extends Logging {

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-25 Thread via GitHub


mridulm commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1118023335


##
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##
@@ -728,7 +800,22 @@ class BlockManagerMasterEndpoint(
 }
 
 if (storageLevel.isValid) {
+  val firstBlock = locations.isEmpty
   locations.add(blockManagerId)
+
+  blockId.asRDDId.foreach { rddBlockId =>
+(trackingCacheVisibility, firstBlock) match {
+  case (true, true) =>
+// Mark as invisible for the first block.
+invisibleRDDBlocks.add(rddBlockId)
+  case (true, false) if !invisibleRDDBlocks.contains(rddBlockId) =>
+// If the rdd block is already visible, ask storage manager to 
update the visibility
+// status.
+blockManagerInfo(blockManagerId).storageEndpoint
+  .ask[Unit](MarkRDDBlockAsVisible(rddBlockId))

Review Comment:
   Ah yes, good point. Makes sense to do this irrespective of the path to block 
creation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #40116: [WIP]SPARK-41391 Fix

2023-02-25 Thread via GitHub


srowen commented on PR #40116:
URL: https://github.com/apache/spark/pull/40116#issuecomment-1445261458

   Please fix the PR description too https://spark.apache.org/contributing.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #40116: [WIP]SPARK-41391 Fix

2023-02-25 Thread via GitHub


srowen commented on PR #40116:
URL: https://github.com/apache/spark/pull/40116#issuecomment-1445261438

   This is about SPARK-41391? it also doesn't contain a simple description of 
what you're reporting, just code snippets. I can work it out, but this could be 
explained in just a few sentences


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-25 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1118013777


##
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala:
##
@@ -728,7 +800,22 @@ class BlockManagerMasterEndpoint(
 }
 
 if (storageLevel.isValid) {
+  val firstBlock = locations.isEmpty
   locations.add(blockManagerId)
+
+  blockId.asRDDId.foreach { rddBlockId =>
+(trackingCacheVisibility, firstBlock) match {
+  case (true, true) =>
+// Mark as invisible for the first block.
+invisibleRDDBlocks.add(rddBlockId)
+  case (true, false) if !invisibleRDDBlocks.contains(rddBlockId) =>
+// If the rdd block is already visible, ask storage manager to 
update the visibility
+// status.
+blockManagerInfo(blockManagerId).storageEndpoint
+  .ask[Unit](MarkRDDBlockAsVisible(rddBlockId))

Review Comment:
   > So ensure that the replica is also marked as visible, right ?
   
   Yes, this is one scenario. Another scenario is that once a task failed 
fetching the cached block from remote executor, it'll compute and cache the 
block again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #40115: [SPARK-42525][SQL] Collapse two adjacent windows with the same partition/order in subquery

2023-02-25 Thread via GitHub


wangyum commented on PR #40115:
URL: https://github.com/apache/spark/pull/40115#issuecomment-1445259053

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum closed pull request #40115: [SPARK-42525][SQL] Collapse two adjacent windows with the same partition/order in subquery

2023-02-25 Thread via GitHub


wangyum closed pull request #40115: [SPARK-42525][SQL] Collapse two adjacent 
windows with the same partition/order in subquery
URL: https://github.com/apache/spark/pull/40115


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell opened a new pull request, #40175: [SPARK-42580][CONNECT] Scala client add client side typed APIs

2023-02-25 Thread via GitHub


hvanhovell opened a new pull request, #40175:
URL: https://github.com/apache/spark/pull/40175

   ### What changes were proposed in this pull request?
   This PR adds the client side typed API to the Spark Connect Scala Client.
   
   ### Why are the changes needed?
   We want to reach API parity with the existing APIs.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, it adds user API.
   
   ### How was this patch tested?
   Added tests to `ClientE2ETestSuite`, and updated existing tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia closed pull request #40174: [SPARK-42573][CONNECT][FOLLOW-UP] fix broken build after variable rename

2023-02-25 Thread via GitHub


amaliujia closed pull request #40174: [SPARK-42573][CONNECT][FOLLOW-UP] fix 
broken build after variable rename
URL: https://github.com/apache/spark/pull/40174


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #40093: [SPARK-42500][SQL] ConstantPropagation supports more cases

2023-02-25 Thread via GitHub


wangyum commented on PR #40093:
URL: https://github.com/apache/spark/pull/40093#issuecomment-1445252943

   TiDB also supports these optimizations:
   https://user-images.githubusercontent.com/5399861/221389051-80dbc027-fc94-4610-bb72-f7514abf276b.png;>
   https://user-images.githubusercontent.com/5399861/221389064-8598c7ac-d472-4ec8-a02a-25d96b44e4da.png;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #40173: [SPARK-42576][CONNECT] Add 2nd groupBy method to Dataset

2023-02-25 Thread via GitHub


hvanhovell closed pull request #40173: [SPARK-42576][CONNECT] Add 2nd groupBy 
method to Dataset
URL: https://github.com/apache/spark/pull/40173


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40173: [SPARK-42576][CONNECT] Add 2nd groupBy method to Dataset

2023-02-25 Thread via GitHub


hvanhovell commented on PR #40173:
URL: https://github.com/apache/spark/pull/40173#issuecomment-1445252735

   Merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ivoson commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-25 Thread via GitHub


ivoson commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1118005755


##
core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala:
##
@@ -399,7 +426,14 @@ private[storage] class BlockInfoManager extends Logging {
 try {
   val wrapper = new BlockInfoWrapper(newBlockInfo, lock)
   while (true) {
-val previous = blockInfoWrappers.putIfAbsent(blockId, wrapper)
+val previous = invisibleRDDBlocks.synchronized {
+  val res = blockInfoWrappers.putIfAbsent(blockId, wrapper)
+  if (res == null && trackingCacheVisibility) {
+// Added to invisible blocks if it doesn't exist before.
+blockId.asRDDId.foreach(invisibleRDDBlocks.add)

Review Comment:
   Currently we syncrhonized all the write operations, and also the read 
operations which need to check both the variables. For other read operations, 
since the variabled won't change, I think synchronized block for such read 
operations may be not necessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40174: [SPARK-42573][CONNECT][FOLLOW-UP] fix broken build after variable rename

2023-02-25 Thread via GitHub


amaliujia commented on PR #40174:
URL: https://github.com/apache/spark/pull/40174#issuecomment-1445246195

   Preferring https://github.com/apache/spark/pull/40173 over this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40173: [SPARK-42576][CONNECT] Add 2nd groupBy method to Dataset

2023-02-25 Thread via GitHub


amaliujia commented on code in PR #40173:
URL: https://github.com/apache/spark/pull/40173#discussion_r1118005633


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -1990,14 +2019,14 @@ class Dataset[T] private[sql] (val sparkSession: 
SparkSession, private[sql] val
   viewName: String,
   replace: Boolean,
   global: Boolean): Unit = {
-val command = session.newCommand { builder =>
+val command = sparkSession.newCommand { builder =>

Review Comment:
   Or we use this PR to fix the build, depending on you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on a diff in pull request #40173: [SPARK-42576][CONNECT] Add 2nd groupBy method to Dataset

2023-02-25 Thread via GitHub


amaliujia commented on code in PR #40173:
URL: https://github.com/apache/spark/pull/40173#discussion_r1118005391


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -1990,14 +2019,14 @@ class Dataset[T] private[sql] (val sparkSession: 
SparkSession, private[sql] val
   viewName: String,
   replace: Boolean,
   global: Boolean): Unit = {
-val command = session.newCommand { builder =>
+val command = sparkSession.newCommand { builder =>

Review Comment:
   https://github.com/apache/spark/pull/40174



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40174: [SPARK-42573][CONNECT][FOLLOW-UP] fix broken build after variable rename

2023-02-25 Thread via GitHub


amaliujia commented on PR #40174:
URL: https://github.com/apache/spark/pull/40174#issuecomment-1445245652

   @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia opened a new pull request, #40174: [SPARK-42573][CONNECT][FOLLOW-UP] fix broken build after variable rename

2023-02-25 Thread via GitHub


amaliujia opened a new pull request, #40174:
URL: https://github.com/apache/spark/pull/40174

   
   
   ### What changes were proposed in this pull request?
   
   Fix BUILD caused by https://github.com/apache/spark/pull/40168. 
   
   ### Why are the changes needed?
   
   Fix BUILD
   
   ### Does this PR introduce _any_ user-facing change?
   
   NO
   
   ### How was this patch tested?
   
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40173: [SPARK-42576][CONNECT] Add 2nd groupBy method to Dataset

2023-02-25 Thread via GitHub


hvanhovell commented on code in PR #40173:
URL: https://github.com/apache/spark/pull/40173#discussion_r1118002760


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -1990,14 +2019,14 @@ class Dataset[T] private[sql] (val sparkSession: 
SparkSession, private[sql] val
   viewName: String,
   replace: Boolean,
   global: Boolean): Unit = {
-val command = session.newCommand { builder =>
+val command = sparkSession.newCommand { builder =>

Review Comment:
   Ah yes, I guess I broke the build...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

2023-02-25 Thread via GitHub


mridulm commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1117987993


##
core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala:
##
@@ -180,6 +186,27 @@ private[storage] class BlockInfoManager extends Logging {
 
   // 
--
 
+  private[spark] def isRDDBlockVisible(blockId: RDDBlockId): Boolean = {
+if (trackingCacheVisibility) {
+  invisibleRDDBlocks.synchronized {
+blockInfoWrappers.containsKey(blockId) && 
!invisibleRDDBlocks.contains(blockId)
+  }
+} else {
+  // Always be visible if the feature flag is disabled.
+  true
+}
+  }
+
+  private[spark] def tryMarkBlockAsVisible(blockId: RDDBlockId): Unit = {
+if (trackingCacheVisibility) {
+  invisibleRDDBlocks.synchronized {
+if (blockInfoWrappers.containsKey(blockId)) {
+  invisibleRDDBlocks.remove(blockId)

Review Comment:
   We dont need the `containsKey` check ...



##
core/src/main/scala/org/apache/spark/storage/BlockInfoManager.scala:
##
@@ -150,6 +150,12 @@ private[storage] class BlockInfoManager extends Logging {
*/
   private[this] val blockInfoWrappers = new ConcurrentHashMap[BlockId, 
BlockInfoWrapper]
 
+  /**
+   * Record invisible rdd blocks stored in the block manager, entries will be 
removed when blocks
+   * are marked as visible or blocks are removed by [[removeBlock()]].
+   */
+  private[spark] val invisibleRDDBlocks = 
ConcurrentHashMap.newKeySet[RDDBlockId]

Review Comment:
   All accesses and updates to `invisibleRDDBlocks` are within the 
`synchronized` block - make it a regular `HashSet` instead ?
   
   Also, let us make it `private[this]`.
   
   For the test cases which are directly accessing variable right now, we can 
add a
   ```
 private[storage] def containsInvisibleRDDBlock(blockId: RDDBlockId): 
Boolean = {
   invisibleRDDBlocks.synchronized {
 invisibleRDDBlocks.contains(blockId) 
   }
 }
   ```



##
core/src/main/scala/org/apache/spark/storage/BlockManager.scala:
##
@@ -1325,31 +1328,74 @@ private[spark] class BlockManager(
 blockInfoManager.releaseAllLocksForTask(taskAttemptId)
   }
 
+  /**
+   * Retrieve the given rdd block if it exists and is visible, otherwise call 
the provided
+   * `makeIterator` method to compute the block, persist it, and return its 
values.
+   *
+   * @return either a BlockResult if the block was successfully cached, or an 
iterator if the block
+   * could not be cached.
+   */
+  def getOrElseUpdateRDDBlock[T](
+  taskId: Long,
+  blockId: RDDBlockId,
+  level: StorageLevel,
+  classTag: ClassTag[T],
+  makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
+val isCacheVisible = isRDDBlockVisible(blockId)
+val res = getOrElseUpdate(blockId, level, classTag, makeIterator, 
isCacheVisible)
+if (res.isLeft && !isCacheVisible) {
+  // Block exists but not visible, report taskId -> blockId info to master.
+  master.updateRDDBlockTaskInfo(blockId, taskId)
+}
+
+res
+  }
+
   /**
* Retrieve the given block if it exists, otherwise call the provided 
`makeIterator` method
* to compute the block, persist it, and return its values.
*
* @return either a BlockResult if the block was successfully cached, or an 
iterator if the block
* could not be cached.
*/
-  def getOrElseUpdate[T](
+  private def getOrElseUpdate[T](
   blockId: BlockId,
   level: StorageLevel,
   classTag: ClassTag[T],
-  makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {
-// Attempt to read the block from local or remote storage. If it's 
present, then we don't need
-// to go through the local-get-or-put path.
-get[T](blockId)(classTag) match {
-  case Some(block) =>
-return Left(block)
-  case _ =>
-// Need to compute the block.
+  makeIterator: () => Iterator[T],
+  isCacheVisible: Boolean = true): Either[BlockResult, Iterator[T]] = {

Review Comment:
   Remove default value for `isCacheVisible`



##
core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala:
##
@@ -2266,6 +2270,160 @@ class BlockManagerSuite extends SparkFunSuite with 
Matchers with PrivateMethodTe
 }
   }
 
+  test("SPARK-41497: getOrElseUpdateRDDBlock do compute based on cache 
visibility statue") {
+val store = makeBlockManager(8000, "executor1")
+val blockId = RDDBlockId(rddId = 1, splitIndex = 1)
+var computed: Boolean = false
+val data = Seq(1, 2, 3)
+val makeIterator = () => {
+  computed = true
+  data.iterator
+}
+
+// Cache doesn't exist and is not visible.
+assert(store.getStatus(blockId).isEmpty && 
!store.isRDDBlockVisible(blockId))
+val res1 = store.getOrElseUpdateRDDBlock(
+  1, blockId, 

[GitHub] [spark] github-actions[bot] closed pull request #38464: [SPARK-32628][SQL] Use bloom filter to improve dynamic partition pruning

2023-02-25 Thread via GitHub


github-actions[bot] closed pull request #38464: [SPARK-32628][SQL] Use bloom 
filter to improve dynamic partition pruning
URL: https://github.com/apache/spark/pull/38464


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ritikam2 commented on pull request #40116: [WIP]SPARK-41391 Fix

2023-02-25 Thread via GitHub


ritikam2 commented on PR #40116:
URL: https://github.com/apache/spark/pull/40116#issuecomment-1445233793

   Sean not sure which issue you were referring to. I updated the why the 
changes are needed section of the pull request to mirror what Zheng had already 
put in his pull request.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40173: [SPARK-42576][CONNECT] Add 2nd groupBy method to Dataset

2023-02-25 Thread via GitHub


amaliujia commented on PR #40173:
URL: https://github.com/apache/spark/pull/40173#issuecomment-1445232248

   @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia opened a new pull request, #40173: [SPARK-42576][CONNECT] Add 2nd groupBy method to Dataset

2023-02-25 Thread via GitHub


amaliujia opened a new pull request, #40173:
URL: https://github.com/apache/spark/pull/40173

   
   
   ### What changes were proposed in this pull request?
   
   Add `groupBy(col1: String, cols: String*)` to Scala client Dataset API.
   
   ### Why are the changes needed?
   
   API coverage
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   NO
   ### How was this patch tested?
   
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] ritikam2 commented on pull request #40116: [WIP]SPARK-41391 Fix

2023-02-25 Thread via GitHub


ritikam2 commented on PR #40116:
URL: https://github.com/apache/spark/pull/40116#issuecomment-1445232031

   I have enabled the workflows on the branch. Is there something else that I 
need to do?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia commented on pull request #40172: [SPARK-42569][CONNECT][FOLLOW-UP] Throw unsupported exceptions for persist

2023-02-25 Thread via GitHub


amaliujia commented on PR #40172:
URL: https://github.com/apache/spark/pull/40172#issuecomment-1445228075

   @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] amaliujia opened a new pull request, #40172: [SPARK-42569][CONNECT][FOLLOW-UP] Throw unsupported exceptions for persist

2023-02-25 Thread via GitHub


amaliujia opened a new pull request, #40172:
URL: https://github.com/apache/spark/pull/40172

   
   
   ### What changes were proposed in this pull request?
   
   Follow up https://github.com/apache/spark/pull/40164 to also throw 
unsupported operation exception for `persist`. Right now we are ok to depends 
on the `StorageLevel` in core module but in the future that shall be refactored 
and moved to a common module.
   
   
   ### Why are the changes needed?
   
   Better way to indicate a non-supported API.
   
   ### Does this PR introduce _any_ user-facing change?
   
   NO
   
   ### How was this patch tested?
   
   N/A


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] joaoleveiga commented on pull request #37817: [SPARK-40376][PYTHON] Avoid Numpy deprecation warning

2023-02-25 Thread via GitHub


joaoleveiga commented on PR #37817:
URL: https://github.com/apache/spark/pull/37817#issuecomment-1445218024

   > Also merged to 3.3
   
   Thank you so much! Here I was assuming I would pick up this thread on monday 
but you delivered it  
   
   Cheers


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] santosh-d3vpl3x commented on pull request #40122: [SPARK-42349][PYTHON] Support pandas cogroup with multiple df

2023-02-25 Thread via GitHub


santosh-d3vpl3x commented on PR #40122:
URL: https://github.com/apache/spark/pull/40122#issuecomment-1445208101

   > qq,
   > 
   > > this is a breaking change to this experimental API.
   > 
   > What's breaking? would be good to keep the PR desc template 
(https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE)
   
   @HyukjinKwon I have changed the description to include the breaking change 
and some more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] aimtsou commented on pull request #37817: [SPARK-40376][PYTHON] Avoid Numpy deprecation warning

2023-02-25 Thread via GitHub


aimtsou commented on PR #37817:
URL: https://github.com/apache/spark/pull/37817#issuecomment-1445189813

   Thank you @srowen, really appreciated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #40169: [SPARK-42575][Connect][Scala] Make all client tests to extend from ConnectFunSuite

2023-02-25 Thread via GitHub


dongjoon-hyun commented on PR #40169:
URL: https://github.com/apache/spark/pull/40169#issuecomment-1445189613

   > Merging. FYI - we are planning a refactor of Catalyst soon (post 3.4) and 
then we will integrate this with Spark's exception and error framework.
   
   Thank you for sharing the future plan, @hvanhovell !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #40166: [SPARK-42570][CONNECT][PYTHON] Fix DataFrameReader to use the default source

2023-02-25 Thread via GitHub


hvanhovell closed pull request #40166: [SPARK-42570][CONNECT][PYTHON] Fix 
DataFrameReader to use the default source
URL: https://github.com/apache/spark/pull/40166


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40166: [SPARK-42570][CONNECT][PYTHON] Fix DataFrameReader to use the default source

2023-02-25 Thread via GitHub


hvanhovell commented on PR #40166:
URL: https://github.com/apache/spark/pull/40166#issuecomment-1445173286

   Merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #40143: [SPARK-42538][CONNECT] Make `sql.functions#lit` function support more types

2023-02-25 Thread via GitHub


hvanhovell closed pull request #40143: [SPARK-42538][CONNECT] Make 
`sql.functions#lit` function support more types
URL: https://github.com/apache/spark/pull/40143


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40143: [SPARK-42538][CONNECT] Make `sql.functions#lit` function support more types

2023-02-25 Thread via GitHub


hvanhovell commented on PR #40143:
URL: https://github.com/apache/spark/pull/40143#issuecomment-1445172271

   Merging to master/3.4. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #40164: [SPARK-42569][CONNECT] Throw unsupported exceptions for non-supported API

2023-02-25 Thread via GitHub


hvanhovell closed pull request #40164: [SPARK-42569][CONNECT] Throw unsupported 
exceptions for non-supported API
URL: https://github.com/apache/spark/pull/40164


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40164: [SPARK-42569][CONNECT] Throw unsupported exceptions for non-supported API

2023-02-25 Thread via GitHub


hvanhovell commented on PR #40164:
URL: https://github.com/apache/spark/pull/40164#issuecomment-1445171880

   I am merging this one. Can you do persist in a follow-up?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on a diff in pull request #40164: [SPARK-42569][CONNECT] Throw unsupported exceptions for non-supported API

2023-02-25 Thread via GitHub


hvanhovell commented on code in PR #40164:
URL: https://github.com/apache/spark/pull/40164#discussion_r1117957143


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -2461,6 +2461,60 @@ class Dataset[T] private[sql] (val session: 
SparkSession, private[sql] val plan:
 new DataFrameWriterV2[T](table, this)
   }
 
+  def unpersist(blocking: Boolean): this.type = {

Review Comment:
   Yeah, we can use the storage enum. We will move the enum to common/util 
later on.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #40167: [SPARK-42561][CONNECT] Add temp view API to Dataset

2023-02-25 Thread via GitHub


hvanhovell closed pull request #40167: [SPARK-42561][CONNECT] Add temp view API 
to Dataset
URL: https://github.com/apache/spark/pull/40167


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40167: [SPARK-42561][CONNECT] Add temp view API to Dataset

2023-02-25 Thread via GitHub


hvanhovell commented on PR #40167:
URL: https://github.com/apache/spark/pull/40167#issuecomment-1445171335

   Merging. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #40169: [SPARK-42575][Connect][Scala] Make all client tests to extend from ConnectFunSuite

2023-02-25 Thread via GitHub


hvanhovell closed pull request #40169: [SPARK-42575][Connect][Scala] Make all 
client tests to extend from ConnectFunSuite
URL: https://github.com/apache/spark/pull/40169


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40169: [SPARK-42575][Connect][Scala] Make all client tests to extend from ConnectFunSuite

2023-02-25 Thread via GitHub


hvanhovell commented on PR #40169:
URL: https://github.com/apache/spark/pull/40169#issuecomment-1445171047

   Merging. FYI - we are planning a refactor of Catalyst soon (post 3.4) and 
then we will integrate this with Spark's exception and error framework.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell closed pull request #40168: [SPARK-42573][Connect][Scala] Enable binary compatibility tests on all major client APIs

2023-02-25 Thread via GitHub


hvanhovell closed pull request #40168: [SPARK-42573][Connect][Scala] Enable 
binary compatibility tests on all major client APIs
URL: https://github.com/apache/spark/pull/40168


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] hvanhovell commented on pull request #40168: [SPARK-42573][Connect][Scala] Enable binary compatibility tests on all major client APIs

2023-02-25 Thread via GitHub


hvanhovell commented on PR #40168:
URL: https://github.com/apache/spark/pull/40168#issuecomment-1445170286

   Merging, thanks for doing this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] bjornjorgensen commented on pull request #40171: [Tests] Refactor TPCH schema to separate file similar to TPCDS for code reuse

2023-02-25 Thread via GitHub


bjornjorgensen commented on PR #40171:
URL: https://github.com/apache/spark/pull/40171#issuecomment-1445168808

   [priv...@spark.apache.org](mailto:priv...@spark.apache.org)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #37817: [SPARK-40376][PYTHON] Avoid Numpy deprecation warning

2023-02-25 Thread via GitHub


srowen commented on PR #37817:
URL: https://github.com/apache/spark/pull/37817#issuecomment-1445157668

   Also merged to 3.3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Kilo59 commented on pull request #29591: [SPARK-32714][PYTHON] Initial pyspark-stubs port.

2023-02-25 Thread via GitHub


Kilo59 commented on PR #29591:
URL: https://github.com/apache/spark/pull/29591#issuecomment-1445147815

   Has anyone solved the problem of trying to type-check pyspark code without 
installing the 200+MB pyspark package?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on pull request #40116: [WIP]SPARK-41391 Fix

2023-02-25 Thread via GitHub


srowen commented on PR #40116:
URL: https://github.com/apache/spark/pull/40116#issuecomment-1445132726

   Eh, this does not explain the issue at all. Please do so.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vicennial commented on a diff in pull request #40147: [SPARK-42543][CONNECT] Specify protocol for UDF artifact transfer in JVM/Scala client

2023-02-25 Thread via GitHub


vicennial commented on code in PR #40147:
URL: https://github.com/apache/spark/pull/40147#discussion_r1117931152


##
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##
@@ -183,6 +183,87 @@ message ExecutePlanResponse {
   }
 }
 
+// Request to transfer client-local artifacts.
+message AddArtifactsRequest {
+
+  // Definition of an Artifact.
+  message Artifact {
+// The name of the artifact is expected in the form of a "Relative Path" 
that is made up of a
+// sequence of directories and the final file element.
+// Examples of "Relative Path"s: "jars/test.jar", "classes/xyz.class", 
"abc.xyz", "a/b/X.jar".
+// The server is expected to maintain the hierarchy of files as defined by 
their name. (i.e
+// The relative path of the file on the server's filesystem will be the 
same as the name of
+// the provided artifact)
+string name = 1;
+// Raw data.
+bytes data = 2;
+// CRC to allow server to verify integrity of the artifact.
+int64 crc = 3;

Review Comment:
   Yes, that makes sense but I'm wondering how we should proceed with the 
naming since the naming `Aritifact` should technically represent any artifact 
regardless of size.
   
   A proposal for the naming:
   Name the smallest unit of data as `ArtifactChunk`
   ```
 message ArtifactChunk {
   bytes data = 1;
   int64 crc = 2;
 }
   ```
   Use the name `SingleAritifact` (or perhaps `CompactArtifact`)  instead of 
`Artifact`
   ```
 message SingleArtifact {
   string name = 1;
   ArtifactChunk data = 2;
 }
   ```
   Rename `BeginChunkedArtifact` to `ChunkedArtifact`
   ```
 message ChunkedArtifact {
   string name = 1;
   int64 total_bytes = 2;
   int64 num_chunks = 3;
   ArtifactChunk initial_chunk = 4;
 }
   ```
   
   Now we can define an "Artifact" (verbally) as either a {`SingleArtifact`} or 
{`ChunkedAritfact` + it's appended `ArtifactChunks`}. WDYT? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] vicennial commented on a diff in pull request #40147: [SPARK-42543][CONNECT] Specify protocol for UDF artifact transfer in JVM/Scala client

2023-02-25 Thread via GitHub


vicennial commented on code in PR #40147:
URL: https://github.com/apache/spark/pull/40147#discussion_r1117927851


##
connector/connect/common/src/main/protobuf/spark/connect/base.proto:
##
@@ -183,6 +183,60 @@ message ExecutePlanResponse {
   }
 }
 
+// Request to transfer client-local artifacts.
+message AddArtifactsRequest {
+
+  // Definition of an Artifact.
+  message Artifact {
+// The name of the artifact is expected in the form of a "Relative Path" 
that is made up of a
+// sequence of directories and the final file element.
+// Examples of "Relative Path"s: "jars/test.jar", "classes/xyz.class", 
"abc.xyz", "a/b/X.jar".
+// The server is expected to maintain the hierarchy of files as defined by 
their name. (i.e
+// The relative path of the file on the server's filesystem will be the 
same as the name of
+// the provided artifact)
+string name = 1;
+// Raw data.
+bytes data = 2;
+  }
+
+  // A number of small artifacts batched into a single RPC.
+  message Batch {
+repeated Artifact artifacts = 1;
+  }
+
+  // The client_id is set by the client to be able to collate streaming 
responses from
+  // different queries.
+  string client_id = 1;
+
+  // User context
+  UserContext user_context = 2;
+
+  // The payload is either a batch of artifacts or a partial chunk of a large 
artifact.
+  oneof payload {
+Batch batch = 3;
+// A large artifact chunked into multiple requests. The server side should 
assume that the
+// artifact has been completely uploaded either when it encounters a new 
artifact name, or
+// when the the stream is completed.
+Artifact chunk = 4;

Review Comment:
   > Do we also need EndChunkedArtifact? 
   
   The `BeginChunkedArtifact` contains some metadata but I'm not sure what 
other data we can include in `EndChunkedArtifact` that's also not available in 
`BeginChunkedArtifact` so I've left it out in order to keep message types 
minimal. However, no strong preference for either way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AlanBateman commented on pull request #39909: [SPARK-42369][CORE] Fix constructor for java.nio.DirectByteBuffer

2023-02-25 Thread via GitHub


AlanBateman commented on PR #39909:
URL: https://github.com/apache/spark/pull/39909#issuecomment-1445042515

   It's the same issue for JDK 8 where it was completely unsupported to be make 
use of the JDK private constructor. For the JDK side, all we can do it strongly 
courage the Spark project to move away from these hacks. If there are 
compelling cases for introducing new APIs or interfaces then please bring them 
to the OpenJDK mailing lists for discussion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR closed pull request #40163: [SPARK-42567][SS][SQL] Track load time for state store provider and log warning if it exceeds threshold

2023-02-25 Thread via GitHub


HeartSaVioR closed pull request #40163: [SPARK-42567][SS][SQL] Track load time 
for state store provider and log warning if it exceeds threshold
URL: https://github.com/apache/spark/pull/40163


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #40163: [SPARK-42567][SS][SQL] Track load time for state store provider and log warning if it exceeds threshold

2023-02-25 Thread via GitHub


HeartSaVioR commented on PR #40163:
URL: https://github.com/apache/spark/pull/40163#issuecomment-1445041909

   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR closed pull request #40162: [SPARK-42566][SS] RocksDB StateStore lock acquisition should happen after getting input iterator from inputRDD

2023-02-25 Thread via GitHub


HeartSaVioR closed pull request #40162: [SPARK-42566][SS] RocksDB StateStore 
lock acquisition should happen after getting input iterator from inputRDD
URL: https://github.com/apache/spark/pull/40162


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on pull request #40162: [SPARK-42566][SS] RocksDB StateStore lock acquisition should happen after getting input iterator from inputRDD

2023-02-25 Thread via GitHub


HeartSaVioR commented on PR #40162:
URL: https://github.com/apache/spark/pull/40162#issuecomment-1445041550

   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Surbhi-Vijay commented on pull request #40171: [Tests] Refactor TPCH schema to separate file similar to TPCDS for code reuse

2023-02-25 Thread via GitHub


Surbhi-Vijay commented on PR #40171:
URL: https://github.com/apache/spark/pull/40171#issuecomment-1445041525

   Can someone please help in creating issue for this pull request ? I do not 
have a ASF Jira account.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Surbhi-Vijay opened a new pull request, #40171: [Tests] Refactor TPCH schema to separate file similar to TPCDS for code reuse

2023-02-25 Thread via GitHub


Surbhi-Vijay opened a new pull request, #40171:
URL: https://github.com/apache/spark/pull/40171

   ### What changes were proposed in this pull request?
   Changes are only in tests files.
   - Refactored the `TPCHBase` class and created `TPCHSchema` similar to TPCDS.
   - Created a base class `TPCSchema` which is extended by `TPCDSSchema` and 
`TPCHSchema`.
   
   ### Why are the changes needed?
   This PR just refactors the code. Going forward it will help in making code 
changes at just one place to reflect in both TPCDS and TPCH. 
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Ran the PlanStabilitySuite to verify the changes. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org