[ https://issues.apache.org/jira/browse/HDFS-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14350828#comment-14350828 ]
Colin Patrick McCabe commented on HDFS-7844: -------------------------------------------- bq. Open addressing (probing) is used for Hash table in the patch. Now the load factor is 0.5, not sure whether we can still get similar performance if we choose a little bigger value (say 0.7, certainly it should be less than 1). Bigger loadfactor will increase the collision, but will save lots of memory. I don't have the performance data, but I see other hash table implementations using open addressing choose 0.7 ~ 0.75 as loadfactor. According to wikipedia (see http://en.wikipedia.org/wiki/Hash_table ), "open addressing" is where you store the records in the hash table itself. This is different than probing, which is when you don't have a linked list per slot, but simply find another slot when collisions occur. Open addressing does require probing, but probing does not require open addressing. This hash table uses probing, but open addressing is optional. In the test code I wrote, open addressing is not used (entries are not stored in the hash table itself... only pointers to entries are stored). Partly this is because I can make the hash table larger that way. In the block manager, we should not use open addressing, because BlockInfo structures are going to have variable size due to the variable replication factors. I think you are correct, though, that we could use a higher load factor than 0.5. How well it will work will depend on a few things. One very important thing is the quality of the hash function. We need good dispersion to avoid clustering and non-constant behavior. bq. How about write it as a configuration and default value is 0.5? Users don't need to change the default value it if they have big memory, but if the memory is limit? Good idea bq. in ProbingHashSet#getInternal ... We should call return null \[when slot == originalSlot\]. (Actually we will never reach there since it's at most half full) Fixed bq. ProbingHashSet#maintainCompactness... Yeah. I looked at this again and it was broken. I think what we should do instead is just call {{putInternal}} on each element with {{overwrite = false}}. Then if we find a key equal to the current one, we know that the element is already in the right slot. bq. Currently even we use slf4j, but in someplace we still do something like Long.toHexString(addr) and it will affect little performance. Can we check the log level in those places? I hate to put all those if statements in, but I think you're probably right. I also fixed one or two cases where I wasn't calling {{Long.toHexString}} on an address. I wish slf4j supported formatting strings. bq. Unnessary import in ProbingHashSet, MemoryManager Fixed bq. 6. typos. Fixed > Create an off-heap hash table implementation > -------------------------------------------- > > Key: HDFS-7844 > URL: https://issues.apache.org/jira/browse/HDFS-7844 > Project: Hadoop HDFS > Issue Type: Sub-task > Affects Versions: HDFS-7836 > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Attachments: HDFS-7844-scl.001.patch > > > Create an off-heap hash table implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)