Good explanation, Brian! To anyone who skipped it because it looked long: read it anyway. cds
On 7/4/2012 5:26 AM, Brian Leach wrote: Sorry to those who already know this, but maybe it's time to go over linear hashing in theory .. Linear hashing was a system devised by Litwin and originally only for in-memory lists. In fact there's some good implementations in C# that provide better handling of Dictionary types. Applying it to a file system adds some complexity but it's basically the same theory. Let's start with a file that has 100 groups initially defined (that's 0 through 99). That is your minimum starting point and should ensure that it never shrinks below that, so it doesn't begin it's life with loads of splits right from the start as you populate the file. You would size this similarly to the way you size a regular hashed file for your initial content: no point making work for yourself (or the database). As data gets added, because the content is allocated unevenly, some of that load will be in primary and some in overflow: that's just the way of the world. No hashing is perfect. Unlike a static file, the overflow can't be added to the end of the file as a linked list (* why nobody has done managed overflow is beyond me), it has to sit in a separate file. At some point the amount of data held in respect of the available space reaches a critical level and the file needs to reorganize. Rather than split the most heavily populated group - which would be the obvious thing - linear hashing works on the basis of a split pointer that moves incrementally through the file. So the first split breaks group 0 and adds group 100 to the end of the file, hopefully moving around half the content of group 0 to this new group. Of course, there is no guarantee that it will depending on key structure) and also no guarantee that this will help anything, if group 0 isn't overflowed or populated anyway. So the next write may also cause a split, except now to split group 1 into a new group 101, and so forth. Eventually the pointer will reach the end and all the initial 100 groups will have been split, and the whole process restarts with the split pointer moving back to zero. You now have 200 groups and by this time everything should in theory have levelled out, but in the meantime there is still overloading and stuff will still be in overflow. The next split will create group 200 and split half of group 0 into it, and the whole process repeats for ever. Oversized records (> buffer size) also get moved out because they stuff up the block allocation. So why this crazy system, rather than hitting the filled groups as they get overstuffed? Because it makes finding a record easy. Because linear hashing is based on a power of 2, the maths is simple - if the group is after the split point, the record MUST be in that group (or its overflow). If it is before the split point, it could be in the original group or the split group: so you can just rehash with double the modulus to check which one without even having to scan the groups. What makes the implementation difficult is that Litwin et al were all assuming a single threaded access to an in-memory list. Concurrent access whilst maintaining the load factor, split pointer and splitting all add a lot more complexity, unless you lock the whole file for the duration of an IO operation and kill the performance. And coming back to the manual, storing large numbers of data items - even large ones - in a type 19 file is a bad idea. Traversing directories is slow, especially in Windows, and locking is done against the whole directory. Brian _______________________________________________ U2-Users mailing list U2-Users@listserver.u2ug.org http://listserver.u2ug.org/mailman/listinfo/u2-users