[ 
https://issues.apache.org/jira/browse/HDFS-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14350828#comment-14350828
 ] 

Colin Patrick McCabe commented on HDFS-7844:
--------------------------------------------

bq. Open addressing (probing) is used for Hash table in the patch. Now the load 
factor is 0.5, not sure whether we can still get similar performance if we 
choose a little bigger value (say 0.7, certainly it should be less than 1). 
Bigger loadfactor will increase the collision, but will save lots of memory. I 
don't have the performance data, but I see other hash table implementations 
using open addressing choose 0.7 ~ 0.75 as loadfactor. 

According to wikipedia (see http://en.wikipedia.org/wiki/Hash_table ), "open 
addressing" is where you store the records in the hash table itself.  This is 
different than probing, which is when you don't have a linked list per slot, 
but simply find another slot when collisions occur.  Open addressing does 
require probing, but probing does not require open addressing.

This hash table uses probing, but open addressing is optional.  In the test 
code I wrote, open addressing is not used (entries are not stored in the hash 
table itself... only pointers to entries are stored).  Partly this is because I 
can make the hash table larger that way.  In the block manager, we should not 
use open addressing, because BlockInfo structures are going to have variable 
size due to the variable replication factors.

I think you are correct, though, that we could use a higher load factor than 
0.5.  How well it will work will depend on a few things.  One very important 
thing is the quality of the hash function.  We need good dispersion to avoid 
clustering and non-constant behavior.

bq. How about write it as a configuration and default value is 0.5? Users don't 
need to change the default value it if they have big memory, but if the memory 
is limit?

Good idea

bq. in ProbingHashSet#getInternal ... We should call return null \[when slot == 
originalSlot\]. (Actually we will never reach there since it's at most half 
full)

Fixed

bq. ProbingHashSet#maintainCompactness...

Yeah.  I looked at this again and it was broken.  I think what we should do 
instead is just call {{putInternal}} on each element with {{overwrite = 
false}}.  Then if we find a key equal to the current one, we know that the 
element is already in the right slot.

bq. Currently even we use slf4j, but in someplace we still do something like 
Long.toHexString(addr) and it will affect little performance. Can we check the 
log level in those places?

I hate to put all those if statements in, but I think you're probably right.  I 
also fixed one or two cases where I wasn't calling {{Long.toHexString}} on an 
address.  I wish slf4j supported formatting strings.

bq. Unnessary import in ProbingHashSet, MemoryManager

Fixed

bq. 6. typos.

Fixed

> Create an off-heap hash table implementation
> --------------------------------------------
>
>                 Key: HDFS-7844
>                 URL: https://issues.apache.org/jira/browse/HDFS-7844
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: HDFS-7836
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-7844-scl.001.patch
>
>
> Create an off-heap hash table implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to