Re: [GENERAL] Using hashtext and unique constraints together

2007-12-12 Thread Peter Childs
On 11/12/2007, Mason Hale [EMAIL PROTECTED] wrote:


 I'm thinking that an insert trigger that ensures (SELECT count(*) FROM
 page WHERE hashtext(url) = 
 hashtext('http://www.postgresql.org'http://www.postgresql.org%27)
 AND url = ' http://www.postgresql.org' ) = 0 won't work given MVCC, as two
 transactions could simultaneously insert the same url at the same time.



Why not so long as it also locks the table (share lock should be enough) but
it could slow the table down if lots of transactions write to the table at
once.

Regards

Peter.


Re: [GENERAL] Using hashtext and unique constraints together

2007-12-12 Thread Daniel Verite
Mason Hale wrote:

 The problem I need help with is guaranteeing uniqueness of the URL on
 insert, with a non-unique index on hashtext(url) and *without* creating a
 unique index on page(url).
 
 I'm thinking that an insert trigger that ensures (SELECT count(*) FROM page
 WHERE hashtext(url) = hashtext('http://www.postgresql.org') AND url = '
 http://www.postgresql.org' ) = 0 won't work given MVCC, as two transactions
 could simultaneously insert the same url at the same time.

Suppose that before that SELECT, the transaction would insert the url
into a small dedicated table, with a unique index on the url column.
If a second transaction starts with the same url value before the first one
commits, it will be blocked at the point of the insert, because of that
unique index. And then it will fail with a constraint violation if and when
the first one commits. And even if it didn't, the SELECT count(*) above 
would return non-zero.
What makes this table small is that each row in it can be discarded as
soon as its corresponding transaction has been committed, since after that
point, the SELECT count(*) inside your trigger has enough visibility to take
care of the unicity check.
So there could be a cron job periodically deleting all rows from the small
table (that is, all committed rows). That would keep its size small, otherwise
of course you'd be eventually back to the original problem of having that
big index.

It's certainly not elegant, but I believe it would work.

-- 
 Daniel
 PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[GENERAL] Using hashtext and unique constraints together

2007-12-11 Thread Mason Hale
I recently discovered the hashtext() function, and I'm interested in using
it to reduce the size of one of the biggest indexes in my database.

I have a table of URLs with a unique index on the URL. Some of these URLs
are very long (we truncate them at 1024 characters), and we have many
millions of these in our database, and so the index is very, very big.

We've considered building an index using an MD5 hash, but the hashtext()
function seems better because it hashes to a 4-byte integer.

The problem with either MD5 or hashtext() is that neither can guarantee
unique output even if the inputs are all unique.

For SELECTs this is no problem, as I could build an index on hashtext(url),
and to write a query to find 'http://www.postgresql.org' would look
something like:

SELECT * FROM page WHERE hashtext(url) = hashtext('
http://www.postgresql.org') AND url = 'http://www.postgresql.org'

The hashtext(url) condition will hit the index, and in the rare event it
returns more than one row, the url = conditions will pick out the matching
one from those.

The problem I need help with is guaranteeing uniqueness of the URL on
insert, with a non-unique index on hashtext(url) and *without* creating a
unique index on page(url).

I'm thinking that an insert trigger that ensures (SELECT count(*) FROM page
WHERE hashtext(url) = hashtext('http://www.postgresql.org') AND url = '
http://www.postgresql.org' ) = 0 won't work given MVCC, as two transactions
could simultaneously insert the same url at the same time.

Can anyone think of a way to guarantee uniqueness of the URL without adding
a unique constraint on the page(url) column?


Re: [GENERAL] Using hashtext and unique constraints together

2007-12-11 Thread Mason Hale
I recently discovered the hashtext() function, and I'm interested in using
 it to reduce the size of one of the biggest indexes in my database.

...

The problem with either MD5 or hashtext() is that neither can guarantee
 unique output even if the inputs are all unique.

...


 The problem I need help with is guaranteeing uniqueness of the URL on
 insert, with a non-unique index on hashtext(url) and *without* creating a
 unique index on page(url).

 I'm thinking that an insert trigger that ensures (SELECT count(*) FROM
 page WHERE hashtext(url) = 
 hashtext('http://www.postgresql.org'http://www.postgresql.org%27)
 AND url = ' http://www.postgresql.org' ) = 0 won't work given MVCC, as two
 transactions could simultaneously insert the same url at the same time.

 Can anyone think of a way to guarantee uniqueness of the URL without
 adding a unique constraint on the page(url) column?


I gather from the lack of response the answer is no ?

Sorry to respond to my own post... but I had an idea -- what if hash the
same URL two different ways, and create a unique compound index on both
hashes?

For example

CREATE unique index on page( hashtext(url), hashtext(md5(url)) )

The idea here is that the odds for two different string to hash to the same
value with hashtext *and* md5 is remote enough to rely on for uniqueness.
I'm converting the md5 to hashtext to save space.

or even

CREATE unique index on page( hashtext(url), hashtext(url || 'SALT') )

An alternative approach, assumes that two different strings that hash to the
same value, will not also hash to the same value if the concatenated with
some additional (constant) data. Again, doesn't absolutely guarantee
uniqueness, but seems like it should be safe enough. I guess it depends on
how truly random the hashtext output is.

Both of the above would require 8 bytes for each index entry (plus
overhead), which is still a big savings over our current implementation.

Any thoughts on these ideas?

thanks in advance,
Mason