Github user Dooyoung-Hwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22219#discussion_r212862913
  
    --- Diff: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
 ---
    @@ -289,6 +289,14 @@ private[hive] class SparkExecuteStatementOperation(
           sqlContext.sparkContext.cancelJobGroup(statementId)
         }
       }
    +
    +  private def getResultIterator(): Iterator[SparkRow] = {
    +    val (totalRowCount, iterResult) = result.collectCountAndIterator()
    +    val batchCollectLimit =
    +      
sqlContext.getConf(SQLConf.THRIFTSERVER_BATCH_COLLECTION_LIMIT.key).toLong
    +    resultList = if (totalRowCount < batchCollectLimit) 
Some(iterResult.toArray) else None
    +    if (resultList.isDefined) resultList.get.iterator else iterResult
    --- End diff --
    
    Yes, the case you commented (incremental collect is disabled & FETCH_FIRST) 
has performance degradation. If total rows are bigger than batchCollectLimit, I 
thought it is not suitable case for caching decompressed rows because of memory 
pressure. If FETCH_FIRST caches compressed rows(not decompressed rows) 
regardless of row count, the result that exceed batchLimit can be cached too. 
But "Iterator" return type may not a good choice for that. Instead "View" of 
scala is proper choice, because "Iterator" can be created again with compressed 
rows "View", but it causes much more source change so I didn't do that.  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to