[ https://issues.apache.org/jira/browse/HBASE-16193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yu Li resolved HBASE-16193. --------------------------- Resolution: Fixed Closing this umbrella since all sub issues resolved and fix pushed into all 0.98+ branches. > Memory leak when putting plenty of duplicated cells > --------------------------------------------------- > > Key: HBASE-16193 > URL: https://issues.apache.org/jira/browse/HBASE-16193 > Project: HBase > Issue Type: Bug > Reporter: Yu Li > Assignee: Yu Li > Attachments: MemoryLeakInMemStore.png, MemoryLeakInMemStore_2.png > > > Recently we suffered from a weird problem that RS heap size could not reduce > much even after FullGC, and it kept FullGC and could hardly serve any > request. After debugging for days, we found the root cause: we won't count in > the allocated memory in MSLAB chunk when adding duplicated cells (including > put and delete). We have below codes in {{AbstractMemStore#add}} (or > {{DefaultMemStore#add}} for branch-1): > {code} > public long add(Cell cell) { > Cell toAdd = maybeCloneWithAllocator(cell); > return internalAdd(toAdd); > } > {code} > where we will allocate memory in MSLAB (if using) chunk for the cell first, > and then call {{internalAdd}}, where we could see below codes in > {{Segment#internalAdd}} (or {{DefaultMemStore#internalAdd}} for branch-1): > {code} > protected long internalAdd(Cell cell) { > boolean succ = getCellSet().add(cell); > long s = AbstractMemStore.heapSizeChange(cell, succ); > updateMetaInfo(cell, s); > return s; > } > {code} > So if we are writing a duplicated cell, we assume there's no heap size > change, while actually the chunk size is taken (referenced). > Let's assume this scenario, that there're huge amount of writing on the same > cell (same key, different values), which is not that special in > MachineLearning use case, and there're also few normal writes, and after some > long time, it's possible that we have many chunks with kvs like: {{cellA, > cellB, cellA, cellA, .... cellA}}, that we only counts 2 cells for each > chunk, but actually the chunk is full. So the devil comes, that we think it's > still not hitting flush size, while there's already GBs heapsize taken. > There's also a more extreme case, that we only writes a single cell over and > over again and fills one chunk quickly. Ideally the chunk should be cleared > by GC, but unfortunately we have kept a redundant reference in > {{HeapMemStore#chunkQueue}}, which is useless when we're not using chunkPool > by default. > This is the umbrella to describe the problem, and I'll open two sub-JIRAs to > resolve the above two issues separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)