[algogeeks] Re: hash

Don Fri, 16 Sep 2011 08:14:55 -0700

There are entire books on the subject, so I can't give you everything,
but here's a quick summary:

First you need a hash function which maps your input to an index in
the table. Hashing strings is a common application, and there are a
lot of string hash functions. Most common is something like this:

int hash(char *s)
{
  int result = 0;
  for(int i = 0; s[i]; ++i)
    result = 17 * result + s[i];
  return result;
}

You would then use mod to map the return value into the size of the
hash table.

There are better hash functions, but that one is not bad for most
purposes.

The next issue to deal with his how to handle collisions. There is
always the possibility that two values will map to the same spot in
the hash table, and the more full the table gets, the more that will
happen.

The most basic method is linear probing, in which you simply check the
next slot, and then the next, etc. until you find an empty slot. There
are several problems with this scheme. First, as the table gets full
it tends to create big clumps of filled cells, which reduces the
efficiency of searching and adding to the table. Secondly, when you
want to delete from the table, you might leave some other element in a
spot where it won't be found. That's because with this scheme, the
search table method must look in the location indicated by the hash
function, and if it is occupied, it must search each cell until it
finds one which matches or until it reaches an empty cell. If you
deleted one of the elements between the hash value of the element you
are searching for and the location where it was ultimately stored, you
will run into trouble if you are not careful. This is typically
addressed by marking a cell where an element has been deleted as
"deleted". Then the search knows to keep looking past that location. A
cleanup can be run occasionally to put things back closer to where
they should be, but the cleanup is costly in terms of time, so you
don't want to do it too often.

Using a rehash function is a better way to deal with collisions. The
rehash is a different hash function which puts the element in another
location. I like to use a hash function which has a second parameter
which defaults to zero, but other values can be used to rehash a
string. Here is one:

int hash(char *str, int rehash=0)
{
  int result = rehash;
  for(int i = 0; s[i]; ++i)
    result ^= s[result^str[i]];   // s[] is a permutation of values
0..255
  return result;
}

This has the nice characteristic that if hash(s1) == hash(s2),
hash(s1,1) and hash(s2,1) are likely to be different.

In some cases you can design a hash function which avoids collisions.
If you have 100 known strings to hash, you can usually use the second
function above and find a permutation s which results in no
collisions. Better yet, if you can control the hash tags, you can
ensure no collisions. In one case I was managing a set of up to N
items, and each one had a unique id tag. The tags didn't need to be
sequential, but each one had to be unique. So I assigned the tags by
building a queue of integers initialized with values 1..N. When I
needed an id for a new item I pulled it out of the queue, and when an
item was deleted, I added N to it's id and put that id into the queue.
This guaranteed that a hash function of id%N would never generate
collisions.

One of my favorite ways to deal with collisions is to have each cell
in the table be a bucket which can contain multiple elements. Making
it be a binary search tree is very nice, because it makes resolving
collisions into an O(ln n) operation rather than O(n) for linear
probing. If you already have a binary search tree class, you can just
make a table of them to use as a hash table.

In most cases you need to design in the ability to delete elements
from the hash table. If you don't need that ability, that makes your
job easier, but most applications do require it. You must make sure
that deleting one item doesn't affect searches on other items, and
that repeated insertions and deletes don't degrade the performance of
the hash table.

Chosing the right size for a hash table is very important. Hash table
performance degrades significantly when the table gets more than half
full. By the time it is 80% full you might as well be using a linear
search. Some hash systems use variable sized hash tables. But changing
size is very expensive because you must reinsert all the items into
the new table. You are better off, when possible, picking the right
size to start with.

There are many schools of thought on hashing, and many ways to tailor
a hashing system to a particular need. Think about the requirements of
your application and look for the best way to accomplish your specific
task.

I hope this helps.

Don

On Sep 16, 9:17 am, prasanth n <nprasnt...@gmail.com> wrote:
> can anyone tell how to implement a hash table in C?? how to choose the hash
> function?? algorithm??
>
> --
> *prasanth*

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to algogeeks@googlegroups.com.
To unsubscribe from this group, send email to 
algogeeks+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.

[algogeeks] Re: hash

Reply via email to