Github user Dooyoung-Hwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22219#discussion_r212862913 --- Diff: sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala --- @@ -289,6 +289,14 @@ private[hive] class SparkExecuteStatementOperation( sqlContext.sparkContext.cancelJobGroup(statementId) } } + + private def getResultIterator(): Iterator[SparkRow] = { + val (totalRowCount, iterResult) = result.collectCountAndIterator() + val batchCollectLimit = + sqlContext.getConf(SQLConf.THRIFTSERVER_BATCH_COLLECTION_LIMIT.key).toLong + resultList = if (totalRowCount < batchCollectLimit) Some(iterResult.toArray) else None + if (resultList.isDefined) resultList.get.iterator else iterResult --- End diff -- Yes, the case you commented (incremental collect is disabled & FETCH_FIRST) has performance degradation. If total rows are bigger than batchCollectLimit, I thought it is not suitable case for caching decompressed rows because of memory pressure. If FETCH_FIRST caches compressed rows(not decompressed rows) regardless of row count, the result that exceed batchLimit can be cached too. But "Iterator" return type may not a good choice for that. Instead "View" of scala is proper choice, because "Iterator" can be created again with compressed rows "View", but it causes much more source change so I didn't do that.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org