[ https://issues.apache.org/jira/browse/HDFS-15140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023077#comment-17023077 ]
Stephen O'Donnell commented on HDFS-15140: ------------------------------------------ One problem I have with this change, is that I have no idea why the FoldedTreeSet structure degrades so badly after some period of time. It would be great to figure out what that is and fix it. The datanode originally used a ResizableGSet for the block map, which is unsorted. However, along with the introduction of FoldedTreeSet, we introduced sorted block reports and the namenode now relies upon them being sorted. Therefore, there are (at least) two choices here: 1. Replace FoldedTreeSet with another sorted structure. Eg TreeMap would work. 2. Switch back to the GSet in the datanode, and then sort the blocks each time we create a block report. In benchmarks I ran, TreeMap seems to perform marginally better than FoldedTreeMap on adds and deletes. For gets, for sets under 8M entries, FoldedTreeSet is marginally faster and then for larger sets TreeMap is faster. The performance gap seems to widen as the sets get bigger in TreeMap's favour. However, a major plus point to FoldedTreeSet, is its reduce memory overhead. Loading each structure with 1M entries and capturing a heap dump, we can see TreeMap has an overhead of about 40 bytes per entry, as each object is stored into a TreeMap$Entry: {code} num #instances #bytes class name ---------------------------------------------- 1: 1000000 40000000 java.util.TreeMap$Entry 2: 1000129 24003096 java.lang.Long 3: 1000000 24000000 com.sodonnell.BlockMock {code} FoldedTreeSet however, has an overhead of about 5 bytes per entry due to storing up to 64 objects in each tree node: {code} num #instances #bytes class name ---------------------------------------------- 1: 1000000 24000000 com.sodonnell.BlockMock 2: 15941 4263344 [Ljava.lang.Object; 3: 15625 875000 com.sodonnell.FoldedTreeSet$Node {code} Therefore switching to TreeMap would cost about (40 - 5) * 1M blocks = 33.37MB of additional heap per 1M blocks and give comparable get, delete and add performance. For option (2), we could capture the list of blocks for a block report, then drop the DN lock and sort them. Block Reports are sent per volume, so sorting volume by volume would be doable. Benchmarking a sort on an array of objects, where the compare is performed against a long gives: {code} Time taken to sort 500000 elements: 338 Time taken to sort 1000000 elements: 337 Time taken to sort 2000000 elements: 632 Time taken to sort 4000000 elements: 1438 Time taken to sort 8000000 elements: 3244 Time taken to sort 16000000 elements: 7260 {code} The extra sort would require materializing another list of references to the objects, but even for 20M blocks (a lot for a DN) at 4 bytes per reference this is only about 76MB which would be quickly GC'ed afterwards. > Replace FoldedTreeSet in Datanode with SortedSet or TreeMap > ----------------------------------------------------------- > > Key: HDFS-15140 > URL: https://issues.apache.org/jira/browse/HDFS-15140 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 3.3.0 > Reporter: Stephen O'Donnell > Assignee: Stephen O'Donnell > Priority: Major > > Based on the problems discussed in HDFS-15131, I would like to explore > replacing the FoldedTreeSet structure in the datanode with a builtin Java > equivalent - either SortedSet or TreeMap. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org