Re: [GENERAL] Trigger Performance

2011-01-17 Thread Jasen Betts
On 2011-01-16, Randall Smith rand...@tnr.cc wrote:
 Before reading.  This is solved.  Was an error on my part.

 On Sun, 2011-01-16 at 03:46 +, Jasen Betts wrote:
 In plpgsql IF is an implicit select. 
 http://www.postgresql.org/docs/8.4/interactive/plpgsql-expressions.html
 
  IF EXISTS (SELECT 1 FROM t1 WHERE
  volume_id = NEW.volume_id AND name = NEW.name) THEN
  RAISE EXCEPTION '% already exists on volume', NEW.name;
  END IF;

 Thanks.  Certainly more concise.

 
  0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds
 
 huh?
 
  According to that stat, this lookup should be adding about 0.5 seconds
  to 10,000 records,
 
 why?  what are you doing to 1 records.

 Inserting them.  Sorry, that was a critical omission on my part.  The
 trigger check is for inserts and I'm testing its performance by
 inserting 10,000 records.

that'll have n.log(n) complexity (or worse) you can't simply multiply by
1 because each row added to the index slows the average index lookup
time a little.

 Turns out my EXPLAIN ANALYZE times were so low because I was querying
 for a volume_id that had only a few rows.  When I query for a volume_id
 with 10,000 rows, it changes to 7 ms, which matches the performance I'm
 seeing.  That's acceptable to me because that's probably at the upper
 end of what I'll see.  7 ms to check 10,000 text fields is actually
 impressive to me.

 how much disk (in bytes, and dollars) are you hoping to save by not
 using the index.

 I need to be able to fit my indexes in RAM.  

why? Indices that don't fit are still useful.  Doing 20 or so record
retreivals to confirm the absense of a record is likely to be slower
than doing 5 reads and a write or two to check and update a disk-based
index.

postgres isn't going to blindly load all the indices into ram.

 I need to be able to fit my indexes in RAM.  This table will have a few
 billion records These text fields can
 be up to 1k each. 

this query will create about 30 sample records with 1K text 
repeat it with dufferent numbers in the first generate_series
 10001,2
 20001,3 
 etc
 until you get 2 billion records
 
 then create your indices and do your testing.

insert into t1 select a::integer as volume_id,
  md5((a*b)::text) || md5((a*b+1)::text) || md5((a*b+3)::text)
  || md5((a*b+4)::text) || md5((a*b+5)::text) || md5((a*b+6)::text) 
  || md5((a*b+7)::text) || md5((a*b+8)::text) || md5((a*b+9)::text) 
  || md5((a*b+11)::text) || md5((a*b+12)::text) || md5((a*b+13)::text)
  || md5((a*b+14)::text) || md5((a*b+15)::text) as name from
(select generate_series( 1 ,1) as a ,generate_series(1,29) as b) as foo;
 

-- 
⚂⚃ 100% natural

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Trigger Performance

2011-01-17 Thread Jasen Betts
On 2011-01-17, Fredric Fredricson fredric.fredric...@bonetmail.com wrote:
 This is a multi-part message in MIME format.
 --00060003050104030402
 Content-Type: text/plain; charset=UTF-8; format=flowed
 Content-Transfer-Encoding: 7bit

 On 01/15/2011 11:52 PM, Randall Smith wrote:
 Hi,

 I've created a trigger that checks the uniqueness of two columns in a
 table.  Traditionally, one would use a unique constraint, but in my
 case, the size of the unique index would be too large and some
 performance loss is acceptable.  However, the trigger performance seems
 to be far below what's indicated by an explain analyze of the query used
 in the trigger.

 The unique fields consist of a an indexed int8 (volume_id) and a text
 field (name).  The average ratio of volume_id to name is 1 to 10,000.
 The query I'm using to check uniqueness in the trigger is:

 A possible work around could be to store an extra field that contains 
 the md5 hash of the text field and use that field in the index instead 
 of the text itself. The md5 sum could be calculated by your trigger.

you don't have to store it, only index on it.

create unique index t1_nodups on t1(  volume_id,md5(name));

-- 
⚂⚃ 100% natural

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Trigger Performance

2011-01-16 Thread Alban Hertroys
On 16 Jan 2011, at 5:38, Randall Smith wrote:

 But you already do have an index on that id-field, so what's the problem 
 with using a unique constraint? Its unique index could just replace the 
 existing one.
 
 I'm OK with indexing the 8 byte integer, but I don't want to index the
 text field that could possibly be up to 1k.  In some simple tests,
 indexing both fields, the index was larger than the table.


Ah, it was about a different field - that makes much more sense :)

You probably don't need the full width of that field to be fairly certain that 
it's going to be unique from that point on anyway. The chance that a unique 
constraint kicks in when it shouldn't gets lower the wider your index on that 
field gets, but it's probably sufficiently low with relatively few characters, 
while it's probably not that big a deal if it kicks in early in a few cases.

For example, if you only index the first 100 bytes of each of these fields, you 
are very unlikely to run into a situation where your constraint claims the 
field is not unique while it actually is.

That shrinks your index down by a factor of 10, and IMO 100 bytes is still on 
the wide side for a unique text column. I don't think many people successfully 
enter 100 characters in a row without ever making an error.

If instead the field contains computer-generated data, then there surely is an 
easier way to detect that this text is not going to be unique. You could store 
the values of the parameters used to generate that data for example, and put a 
unique constraint on those.


I'd also wonder what the value is of two distinct records with the same 1000 
characters of text up to the last one. Are those distinct on purpose or is it 
more likely that someone made an error entering that field and accidentally 
managed to create a unique entry while in fact the unique constraint was 
supposed to kick in?

Besides that, you say there will be billions of records, but that text field 
needs to be unique? I really do hope the contents of that field will be 
computer-generated, or your users will get quite desperate coming up with new 
values to use after a while, especially if the contents need to make any sense.

But of course much depends on the nature of the data in your field. You haven't 
said anything about that.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,4d32bdf511764853411139!



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Trigger Performance

2011-01-16 Thread Tom Lane
Randall Smith rand...@tnr.cc writes:
 I've created a trigger that checks the uniqueness of two columns in a
 table.  Traditionally, one would use a unique constraint, but in my
 case, the size of the unique index would be too large and some
 performance loss is acceptable.  However, the trigger performance seems
 to be far below what's indicated by an explain analyze of the query used
 in the trigger.

You realize of course that this is fundamentally broken and cannot be
trusted?  Worrying about the performance seems rather pointless.

The reason it can't be trusted is that two concurrent insertions will
neither see the other one.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Trigger Performance

2011-01-16 Thread Fredric Fredricson

On 01/15/2011 11:52 PM, Randall Smith wrote:

Hi,

I've created a trigger that checks the uniqueness of two columns in a
table.  Traditionally, one would use a unique constraint, but in my
case, the size of the unique index would be too large and some
performance loss is acceptable.  However, the trigger performance seems
to be far below what's indicated by an explain analyze of the query used
in the trigger.

The unique fields consist of a an indexed int8 (volume_id) and a text
field (name).  The average ratio of volume_id to name is 1 to 10,000.
The query I'm using to check uniqueness in the trigger is:
A possible work around could be to store an extra field that contains 
the md5 hash of the text field and use that field in the index instead 
of the text itself. The md5 sum could be calculated by your trigger.


/Fredric


 ...
 IF (SELECT EXISTS (SELECT 1 FROM t1 WHERE
 volume_id = NEW.volume_id AND name = NEW.name)) THEN
 RAISE EXCEPTION '% already exists on volume', NEW.name;
 END IF;
 ...

This trigger is called only BEFORE INSERTS.

In testing, without the trigger, inserts are at 10,000 every 2 seconds.
With the trigger, they are 10,000 every 40 seconds.  The output of
explain analyze suggests that this shouldn't be taking so long.

 EXPLAIN ANALYZE SELECT EXISTS (
 SELECT 1 FROM t1 WHERE volume_id = 300 AND name = 'whodat');

  Result  (cost=8.35..8.36 rows=1 width=0) (actual time=0.019..0.020
rows=1 loops=1)
InitPlan 1 (returns $0)
  -   Index Scan using volume_id_idx on t1  (cost=0.00..8.35 rows=1
width=0) (actual time=0.016..0.016 rows=0 loops=1)
Index Cond: (volume_id = 300)
Filter: (name = 'whodat'::text)
  Total runtime: 0.053 ms
(6 rows)

0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds

According to that stat, this lookup should be adding about 0.5 seconds
to 10,000 records, far from the actual 38 seconds it is adding.  I've
tried to change up the query in the trigger to see if I could get
different results with not much luck.  Any idea what might be taking up
the extra time or what I can do to troubleshoot?

Thanks.

-Randall








attachment: Fredric_Fredricson.vcf
-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Trigger Performance

2011-01-15 Thread Alban Hertroys
On 15 Jan 2011, at 23:52, Randall Smith wrote:

 Hi,
 
 I've created a trigger that checks the uniqueness of two columns in a
 table.  Traditionally, one would use a unique constraint, but in my
 case, the size of the unique index would be too large and some
 performance loss is acceptable.

But you already do have an index on that id-field, so what's the problem with 
using a unique constraint? Its unique index could just replace the existing one.

Alban Hertroys

--
Screwing up is an excellent way to attach something to the ceiling.


!DSPAM:737,4d323b1e11871071717716!



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Trigger Performance

2011-01-15 Thread Jasen Betts
On 2011-01-15, Randall Smith rand...@tnr.cc wrote:
 Hi,

 I've created a trigger that checks the uniqueness of two columns in a
 table.  Traditionally, one would use a unique constraint, but in my
 case, the size of the unique index would be too large and some
 performance loss is acceptable.  However, the trigger performance seems
 to be far below what's indicated by an explain analyze of the query used
 in the trigger.

 The unique fields consist of a an indexed int8 (volume_id) and a text
 field (name).  The average ratio of volume_id to name is 1 to 10,000.
 The query I'm using to check uniqueness in the trigger is:

 ...
 IF (SELECT EXISTS (SELECT 1 FROM t1 WHERE
 volume_id = NEW.volume_id AND name = NEW.name)) THEN
 RAISE EXCEPTION '% already exists on volume', NEW.name;
 END IF;
 ...

In plpgsql IF is an implicit select. 
http://www.postgresql.org/docs/8.4/interactive/plpgsql-expressions.html

 IF EXISTS (SELECT 1 FROM t1 WHERE
 volume_id = NEW.volume_id AND name = NEW.name) THEN
 RAISE EXCEPTION '% already exists on volume', NEW.name;
 END IF;

i don't know if that will help performance though

  Result  (cost=8.35..8.36 rows=1 width=0) (actual time=0.019..0.020
 rows=1 loops=1)
InitPlan 1 (returns $0)
  -  Index Scan using volume_id_idx on t1  (cost=0.00..8.35 rows=1
 width=0) (actual time=0.016..0.016 rows=0 loops=1)
Index Cond: (volume_id = 300)
Filter: (name = 'whodat'::text)
  Total runtime: 0.053 ms
 (6 rows)

I got 0.4ms the first time I tried this.

that's actual runtime is helped by cache locality, reconnect and try
it again and you'll see a worse figure
( I got a factor of three difference)

at 53us ir probably didn't hit the hard disk, when that starts
happening things will get much worse.

 0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds

huh?

 According to that stat, this lookup should be adding about 0.5 seconds
 to 10,000 records,

why?  what are you doing to 1 records.

 far from the actual 38 seconds it is adding.  I've
 tried to change up the query in the trigger to see if I could get
 different results with not much luck.  Any idea what might be taking up
 the extra time or what I can do to troubleshoot?

Try it without that check (on sample data) and see how much faster it goes 
(just comment out that part of the trigger)
when I tested that here I could not see a definate difference.
(timings were all over the place, some slower some faster)

how much disk (in bytes, and dollars) are you hoping to save by not
using the index.

-- 
⚂⚃ 100% natural

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Trigger Performance

2011-01-15 Thread Randall Smith
Before reading.  This is solved.  Was an error on my part.

On Sun, 2011-01-16 at 03:46 +, Jasen Betts wrote:
 In plpgsql IF is an implicit select. 
 http://www.postgresql.org/docs/8.4/interactive/plpgsql-expressions.html
 
  IF EXISTS (SELECT 1 FROM t1 WHERE
  volume_id = NEW.volume_id AND name = NEW.name) THEN
  RAISE EXCEPTION '% already exists on volume', NEW.name;
  END IF;

Thanks.  Certainly more concise.

 
  0.053 ms/record / 1,000 ms/sec * 10,000 records = .53 seconds
 
 huh?
 
  According to that stat, this lookup should be adding about 0.5 seconds
  to 10,000 records,
 
 why?  what are you doing to 1 records.


Inserting them.  Sorry, that was a critical omission on my part.  The
trigger check is for inserts and I'm testing its performance by
inserting 10,000 records.

Turns out my EXPLAIN ANALYZE times were so low because I was querying
for a volume_id that had only a few rows.  When I query for a volume_id
with 10,000 rows, it changes to 7 ms, which matches the performance I'm
seeing.  That's acceptable to me because that's probably at the upper
end of what I'll see.  7 ms to check 10,000 text fields is actually
impressive to me.


 
 how much disk (in bytes, and dollars) are you hoping to save by not
 using the index.
 

I need to be able to fit my indexes in RAM.  This table will have a few
billion records and I have several other indexes with billions of
records and I'd like my DB to run well on a machine with 20G (preferred)
60G (max) RAM and not have to resort to sharding.  These text fields can
be up to 1k each.  A 1 billion row int8 index comes in around 2G.
Adding the text field to the index would probably put it at over 20G per
billion records.

Thanks.

-Randall


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general