[GitHub] [spark] hvanhovell commented on a diff in pull request #40610: [SPARK-42626][CONNECT] Add Destructive Iterator for SparkResult

via GitHub Thu, 30 Mar 2023 20:41:05 -0700


hvanhovell commented on code in PR #40610:
URL: https://github.com/apache/spark/pull/40610#discussion_r1153980266



##########
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkResult.scala:
##########
@@ -134,24 +134,41 @@ private[sql] class SparkResult[T](
   /**
    * Returns an iterator over the contents of the result.
    */
-  def iterator: java.util.Iterator[T] with AutoCloseable = {
+  def iterator: java.util.Iterator[T] with AutoCloseable =
+    buildIterator(destructive = false)
+
+  /**
+   * Returns an destructive iterator over the contents of the result.
+   */
+  def destructiveIterator: java.util.Iterator[T] with AutoCloseable =
+    buildIterator(destructive = true)
+
+  private def buildIterator(destructive: Boolean): java.util.Iterator[T] with 
AutoCloseable = {
     new java.util.Iterator[T] with AutoCloseable {
-      private[this] var batchIndex: Int = -1
       private[this] var iterator: java.util.Iterator[InternalRow] = 
Collections.emptyIterator()
       private[this] var deserializer: Deserializer[T] = _
+      private[this] var currentBatch: ColumnarBatch = _
+      private[this] val _destructive: Boolean = destructive
+
       override def hasNext: Boolean = {
         if (iterator.hasNext) {
           return true
         }
-        val nextBatchIndex = batchIndex + 1
+        val batchIndex = batches.indexOf(currentBatch)

Review Comment:
    I have been looking at this a for a bit now. I am not sure if I like it. 
There are two issues:
   - In destructive mode you know the location of the current batch. It should 
be at index = 0. In non destructive mode the index should be `batchIndex`. We 
are not doing anything with that information.
   - The removal can be pretty expensive since we are removing from the head.
   
   I am wondering if we can use a better suited data structure here. You could 
use a map, since that will give you cheap removals, and fairly fast lookups. 
Alternatively we could implement something a-kin to a linkedlist (I don't think 
you can use a stock linked list since those don't like updates during 
iteration).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40610: [SPARK-42626][CONNECT] Add Destructive Iterator for SparkResult

Reply via email to