Re: Blooming Filters (was: Re: [Job] MySQL consultation)

Tzahi Fadida Thu, 22 Mar 2007 07:14:46 -0800

On Thursday 22 March 2007 16:18, Peter wrote:
> On Thu, 22 Mar 2007, Tzahi Fadida wrote:
> > Advocating is a strong word, i was suggesting. How exactly would you
> > address 128gb,256gb? Unless of course your system board and CPU
> > supports such sizes...
>
> The board does not care about sizes. Disk requests are serialized and
> they can be any lengths. Implementing a 1024 bit wide address counter to
> be pushed out serially to hardware is trivial even with an 8 bit cpu
> from 20 years ago. The problem is speed and size. Anything that fits in


And where prey tell, you can go to a store today and get a computer that 
support addressing of 1024 bits RAM?
Being realistic, you have a 32bit system in place and all you need is to 
implement the 2 or 4 gb blooming filter, why buy an insanely expansive new 
computer instead of just adding some PCI with some memory that would be good 
enough for your needs? Like an Asus battery backed ram drive in 50$ + as much 
ddrs you need. I think 1gb~=500nis *4 = 2000. ANS + 50$ = 2250nis.

New computer = 5000nis + 2250nis at least.

> 1 register can be manipulated in 1 clock cycle or less, that is fast.
> Thus jumping around in an index or a tree using 32 bit integers on 32
> bit hardware with ~3-4GB of RAM is not a problem. When more than 32 bits
> are needed things slow down a lot. It can go from 1 clock cycle to 4-5.
> When the 'local' data size is larger than the cache, things slow down to
> main memory speed. When that is not large enough, you start using disk
> seeks and VM swapping. Strictly speaking, a 32bit machine could handle

Swapping is not an option with hashing like that since the seeks can be all 
over the place thus, rendering the whole thing as if you were using only the 
disk and practically worse.

> 100TB or more of data, working as a Turing machine on the 100TB 'tape'
> (or tapes), but you really wouldn't want that (insert memories of
> recompiling Linux on i386 with 8MB ram here). One of the reasons RDBMSs
> 'like' to run on 'bare' partitions is exectly this: they prefer to use
> their own seek, hash, and striping algorythms instead of relying on the
> OS. So by the time any dimension of the problem touches on 2^32 things
> can slow down 10-1000 and worse times (even without script kiddies using
> SQL and PHP4 scripts to handle the output).

From what i understand from PostgreSQL dev team which some were also involved 
in oracle and knows how the raw partition usage works, the gains, i 
understand, over using a common linux FS, is no more than 10-20%.

> > As for 3GB, As i understand you must either have 2gb,4gb,... for this
> > blooming filters, i.e. you need 4gb which does not leave much room for
> > your kernel and apps in 32bit systems (and btw swapping is not really
> > an option with this hash func). As for "expensive", some memories
>
> There is no fixed size for Bloom filters, they are probabilistic. You
> can make a 10 bit Blooming filter, it depends on your hash algorythm.

Hence i said 2gb,4gb,...

> bogofilter) are also closely related to this afaik. The point is that
> there are ways to build very fast speculative indexes over huge data
> sets without actually storing the data. This can reduce the number of

I doubt that applies here since the hash function actually tries to spread the 
keys as much as possible to avoid false positives as much as possible.

> actual (expensive) lookups by orders of magnitude. I am not sure what
> Google uses for algorythms internally but from my adventures with web
> publishing and so on I would say that they are using similar principles.

They are not checking for existence they are locating actual data, thus, they 
would not use a hash function that can return false positives.
Btw, i was actually considering using blooming filters in my thesis for the 
full disjunctions algorithm, however, i decided that a simple hash table 
would be better. The reason was that if your application requires accuracy 
then, you cannot use a good enough answer. I.e. after getting a true value i 
would have to actually verify existence rendering it useless if i can just 
use a hash table (but not useless if the database does not support a proper 
hash table).

-- 
Regards,
        Tzahi.
--
Tzahi Fadida
Blog: http://tzahi.blogsite.org | Home Site: http://tzahi.webhop.info
WARNING TO SPAMMERS:  see at 
http://members.lycos.co.uk/my2nis/spamwarning.html

================================================================To unsubscribe, 
send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Re: Blooming Filters (was: Re: [Job] MySQL consultation)

Reply via email to