[jira] [Commented] (SPARK-25224) Improvement of Spark SQL ThriftServer memory management

ramakrishna chilaka (Jira) Tue, 09 Nov 2021 04:33:06 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-25224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441131#comment-17441131
 ]


ramakrishna chilaka commented on SPARK-25224:
---------------------------------------------

can anyone please confirm, if there are any plans to revive this ? Thanks.

> Improvement of Spark SQL ThriftServer memory management
> -------------------------------------------------------
>
>                 Key: SPARK-25224
>                 URL: https://issues.apache.org/jira/browse/SPARK-25224
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Dooyoung Hwang
>            Priority: Major
>              Labels: bulk-closed
>
> Spark SQL just have two options for managing thriftserver memory - enable 
> spark.sql.thriftServer.incrementalCollect or not
> *1. The case of enabling spark.sql.thriftServer.incrementalCollects*
> *1) Pros :* thriftserver can handle large output without OOM.
> *2) Cons*
>  * Performance degradation because of executing task partition by partition.
>  * Handle queries with count-limit inefficiently because of executing all 
> partitions. (executeTake stop scanning after collecting count-limit.)
>  * Cannot cache result for FETCH_FIRST
> *2. The case of disabling spark.sql.thriftServer.incrementalCollects*
> *1) Pros :* Good performance for small output
> *2) Cons*
>  * Memory peak usage is too large because allocating decompressed & 
> deserialized rows in "batch" manner, and OOM could occur for large output.
>  * It is difficult to measure memory peak usage of Query, so configuring 
> spark.driver.maxResultSize is very difficult.
>  * If decompressed & deserialized rows fills up eden area of JVM Heap, they 
> moves to old Gen and could increase possibility of "Full GC" that stops the 
> world.
>  
> The improvement idea is below:
>  # *DataSet does not decompress & deserialize result, and just return total 
> row count & iterator to SQL-Executor.* By doing that, only uncompressed data 
> reside in memory, so that the memory usage is not only much lower than before 
> but is configurable with using spark.driver.maxResultSize.
>  # *After SQL-Executor get total row count & iterator from DataSet, it could 
> decide whether collecting them as batch manner(appropriate for small row 
> count) or deserializing and sending them iteratively (appropriate for large 
> row count) with considering returned row count.*



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25224) Improvement of Spark SQL ThriftServer memory management

Reply via email to