[ https://issues.apache.org/jira/browse/SPARK-25224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441131#comment-17441131 ]
ramakrishna chilaka commented on SPARK-25224: --------------------------------------------- can anyone please confirm, if there are any plans to revive this ? Thanks. > Improvement of Spark SQL ThriftServer memory management > ------------------------------------------------------- > > Key: SPARK-25224 > URL: https://issues.apache.org/jira/browse/SPARK-25224 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.1 > Reporter: Dooyoung Hwang > Priority: Major > Labels: bulk-closed > > Spark SQL just have two options for managing thriftserver memory - enable > spark.sql.thriftServer.incrementalCollect or not > *1. The case of enabling spark.sql.thriftServer.incrementalCollects* > *1) Pros :* thriftserver can handle large output without OOM. > *2) Cons* > * Performance degradation because of executing task partition by partition. > * Handle queries with count-limit inefficiently because of executing all > partitions. (executeTake stop scanning after collecting count-limit.) > * Cannot cache result for FETCH_FIRST > *2. The case of disabling spark.sql.thriftServer.incrementalCollects* > *1) Pros :* Good performance for small output > *2) Cons* > * Memory peak usage is too large because allocating decompressed & > deserialized rows in "batch" manner, and OOM could occur for large output. > * It is difficult to measure memory peak usage of Query, so configuring > spark.driver.maxResultSize is very difficult. > * If decompressed & deserialized rows fills up eden area of JVM Heap, they > moves to old Gen and could increase possibility of "Full GC" that stops the > world. > > The improvement idea is below: > # *DataSet does not decompress & deserialize result, and just return total > row count & iterator to SQL-Executor.* By doing that, only uncompressed data > reside in memory, so that the memory usage is not only much lower than before > but is configurable with using spark.driver.maxResultSize. > # *After SQL-Executor get total row count & iterator from DataSet, it could > decide whether collecting them as batch manner(appropriate for small row > count) or deserializing and sending them iteratively (appropriate for large > row count) with considering returned row count.* -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org