Grant Henke created KUDU-2466:
---------------------------------

             Summary: Fault tolerant scanners can over-allocate memory and 
crash a cluster
                 Key: KUDU-2466
                 URL: https://issues.apache.org/jira/browse/KUDU-2466
             Project: Kudu
          Issue Type: Bug
    Affects Versions: 1.4.0
            Reporter: Grant Henke


When testing a Spark job with fault tolerant scanners enabled, reading a large 
table (~1.5TB replicated) with many columns resulted in using up all of the 
memory on the tablet servers. 400 GB of total memory was being consumed though 
the memory limit was configured for 60 GB. This impacted all services on the 
machines making the cluster effectively unusable. Killing the job running the 
scans did not free the memory. However, restarting the Tablet servers resulted 
in a healthy cluster. 

 

Based on a chat with [~tlipcon], [~jdcryans], and [~mpercy] it looks like we 
are not lazy in MergeIterator initialization and we could fix this by being 
lazy about the merger based on rowset bounds. Limiting the number of 
concurrently open scanners to O(rowset height).

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to