Good explanation, Brian!
To anyone who skipped it because it looked long:  read it anyway.
cds

On 7/4/2012 5:26 AM, Brian Leach wrote:

Sorry to those who already know this, but maybe it's time to go over linear 
hashing in theory ..

Linear hashing was a system devised by Litwin and originally only for in-memory 
lists. In fact there's some good implementations in C# that provide better 
handling of Dictionary types. Applying it to a file system adds some complexity 
but it's basically the same theory.

Let's start with a file that has 100 groups initially defined (that's 0 through 
99). That is your minimum starting point and should ensure that it never 
shrinks below that, so it doesn't begin it's life with loads of splits right 
from the start as you populate the file. You would size this similarly to the 
way you size a regular hashed file for your initial content: no point making 
work for yourself (or the database).

As data gets added, because the content is allocated unevenly, some of that 
load will be in primary and some in overflow: that's just the way of the world. 
No hashing is perfect. Unlike a static file, the overflow can't be added to the 
end of the file as a linked list (* why nobody has done managed overflow is 
beyond me), it has to sit in a separate file.

At some point the amount of data held in respect of the available space reaches 
a critical level and the file needs to reorganize. Rather than split the most 
heavily populated group - which would be the obvious thing - linear hashing 
works on the basis of a split pointer that moves incrementally through the 
file. So the first split breaks group 0 and adds group 100 to the end of the 
file, hopefully moving around half the content of group 0 to this new group. Of 
course, there is no guarantee that it will depending on key structure) and also 
no guarantee that this will help anything, if group 0 isn't overflowed or 
populated anyway. So the next write may also cause a split, except now to split 
group 1 into a new group 101, and so forth.

Eventually the pointer will reach the end and all the initial 100 groups will 
have been split, and the whole process restarts with the split pointer moving 
back to zero. You now have 200 groups and by this time everything should in 
theory have levelled out, but in the meantime there is still overloading and 
stuff will still be in overflow. The next split will create group 200 and split 
half of group 0 into it, and the whole process repeats for ever.

Oversized records (> buffer size) also get moved out because they stuff up the 
block allocation.

So why this crazy system, rather than hitting the filled groups as they get 
overstuffed? Because it makes finding a record easy. Because linear hashing is 
based on a power of 2, the maths is simple - if the group is after the split 
point, the record MUST be in that group (or its overflow). If it is before the 
split point, it could be in the original group or the split group: so you can 
just rehash with double the modulus to check which one without even having to 
scan the groups.

What makes the implementation difficult is that Litwin et al were all assuming 
a single threaded access to an in-memory list. Concurrent access whilst 
maintaining the load factor, split pointer and splitting all add a lot more 
complexity, unless you lock the whole file for the duration of an IO operation 
and kill the performance.

And coming back to the manual, storing large numbers of data items - even large 
ones - in a type 19 file is a bad idea. Traversing directories is slow, 
especially in Windows, and locking is done against the whole directory.

Brian

_______________________________________________
U2-Users mailing list
U2-Users@listserver.u2ug.org
http://listserver.u2ug.org/mailman/listinfo/u2-users

Reply via email to