[PERFORM] Feature suggestion : FAST CLUSTER

2007-05-22 Thread PFC



	Well, CLUSTER is so slow (and it doesn't cluster the toast tables  
associated with the table to be clustered).

However, when people use CLUSTER they use it to speed up their queries.
For that the table does not need to be perfectly in-order.

So, here is a new idea for CLUSTER :

- choose a chunk size (about 50% of your RAM)
- setup disk sorts for all indexes
- seq scan the table :
- take a chunk of chunk_size
- sort it (in memory)
- write it into new table file
		- while we have the data on-hand, also send the indexed columns data  
into the corresponding disk-sorts


- finish the index disk sorts and rebuild indexes

	This does not run a complete sort on the table. It would be about as fast  
as your seq scan disk throughput. Obviously, the end result is not as good  
as a real CLUSTER since the table will be made up of several ordered  
chunks and a range lookup. Therefore, a range lookup on the clustered  
columns would need at most N seeks, versus 1 for a really clustered table.  
But it only scans the table once and writes it once, even counting index  
rebuild.


	I would think that, with this approach, if people can CLUSTER a large  
table in 5 minutes instead of hours, they will use it, instead of not  
using it. Therefore, even if the resulting table is not as optimal as a  
fully clustered table, it will still be much better than the non-clustered  
case.





---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Postgres Benchmark Results

2007-05-22 Thread Gregory Stark
"Alvaro Herrera" <[EMAIL PROTECTED]> writes:

> Scott Marlowe wrote:
>> 
>> I thought you were limited to 250 or so COMMITS to disk per second, and 
>> since >1 client can be committed at once, you could do greater than 250 
>> tps, as long as you had >1 client providing input.  Or was I wrong?
>
> My impression is that you are correct in theory -- this is the "commit
> delay" feature.  But it seems that the feature does not work as well as
> one would like; and furthermore, it is disabled by default.

Even without commit delay a client will commit any pending WAL records when it
syncs the WAL. The clients waiting to commit their records will find it
already synced when they get woken up.

However as mentioned a while back in practice it doesn't work quite right and
you should expect to get 1/2 the expected performance. So even with 10 clients
you should expect to see 5*120 tps on a 7200 rpm drive and 5*250 tps on a
15kprm drive.

Heikki posted a patch that experimented with fixing this. Hopefully it'll be
fixed for 8.4.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Postgres Benchmark Results

2007-05-22 Thread Gregory Stark

What's interesting here is that on a couple metrics the green curve is
actually *better* until it takes that nosedive at 500 MB. Obviously it's not
better on average hits/s, the most obvious metric. But on deviation and
worst-case hits/s it's actually doing better.

Note that while the average hits/s between 100 and 500 is over 600 tps for
Postgres there is a consistent smattering of plot points spread all the way
down to 200 tps, well below the 400-500 tps that MySQL is getting.

Some of those are undoubtedly caused by things like checkpoints and vacuum
runs. Hopefully the improvements that are already in the pipeline will reduce
them.

I mention this only to try to move some of the focus from the average
performance to trying to remove the pitfalls that affact 1-10% of transactions
and screw the worst-case performance. In practical terms it's the worst-case
that governs perceptions, not average case.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[PERFORM] Key/Value reference table generation: INSERT/UPDATE performance

2007-05-22 Thread valgog
I found several post about INSERT/UPDATE performance in this group,
but actually it was not really what I am searching an answer for...

I have a simple reference table WORD_COUNTS that contains the count of
words that appear in a word array storage in another table.

CREATE TABLE WORD_COUNTS
(
  word text NOT NULL,
  count integer,
  CONSTRAINT PK_WORD_COUNTS PRIMARY KEY (word)
)
WITHOUT OIDS;

I have some PL/pgSQL code in a stored procedure like

  FOR r
   IN select id, array_of_words
from word_storage
  LOOP
begin
  -- insert the missing words
  insert into WORD_COUNTS
  ( word, count )
  ( select word, 0
  from ( select distinct (r.array_of_words)
[s.index] as d_word
   from generate_series(1,
array_upper( r.array_of_words, 1 ) ) as s(index) ) as distinct_words
 where word not in ( select d_word from
WORD_COUNTS  ) );
  -- update the counts
  update WORD_COUNTS
 set count = COALESCE( count, 0 ) + 1
   where word in ( select distinct (r.array_of_words)[s.index] as
word
from generate_series(1,
array_upper( r.array_of_words, 1) ) as s(index) );
exception when others then
  error_count := error_count + 1;
end;
record_count := record_count + 1;
  END LOOP;

This code runs extremely slowly. It takes about 10 minutes to process
1 records and the word storage has more then 2 million records to
be processed.

Does anybody have a know-how about populating of such a reference
tables and what can be optimized in this situation.

Maybe the generate_series() procedure to unnest the array is the place
where I loose the performance?

Are the set update/inserts more effitient, then single inserts/updates
run in smaller loops?

Thanks for your help,

Valentine Gogichashvili


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Key/Value reference table generation: INSERT/UPDATE performance

2007-05-22 Thread Peter Childs

On 22 May 2007 01:23:03 -0700, valgog <[EMAIL PROTECTED]> wrote:


I found several post about INSERT/UPDATE performance in this group,
but actually it was not really what I am searching an answer for...

I have a simple reference table WORD_COUNTS that contains the count of
words that appear in a word array storage in another table.

CREATE TABLE WORD_COUNTS
(
  word text NOT NULL,
  count integer,
  CONSTRAINT PK_WORD_COUNTS PRIMARY KEY (word)
)
WITHOUT OIDS;




Is there any reason why count is not not null? (That should siplify your
code by removing the coalesce)

insert is more efficient than update because update is always a delete
followed by an insert.

Oh and group by is nearly always quicker than distinct and can always? be
rewritten as such. I'm not 100% sure why its different but it is.

Peter.



I have some PL/pgSQL code in a stored procedure like


  FOR r
   IN select id, array_of_words
from word_storage
  LOOP
begin
  -- insert the missing words
  insert into WORD_COUNTS
  ( word, count )
  ( select word, 0
  from ( select distinct (r.array_of_words)
[s.index] as d_word
   from generate_series(1,
array_upper( r.array_of_words, 1 ) ) as s(index) ) as distinct_words
 where word not in ( select d_word from
WORD_COUNTS  ) );
  -- update the counts
  update WORD_COUNTS
 set count = COALESCE( count, 0 ) + 1
   where word in ( select distinct (r.array_of_words)[s.index] as
word
from generate_series(1,
array_upper( r.array_of_words, 1) ) as s(index) );
exception when others then
  error_count := error_count + 1;
end;
record_count := record_count + 1;
  END LOOP;

This code runs extremely slowly. It takes about 10 minutes to process
1 records and the word storage has more then 2 million records to
be processed.

Does anybody have a know-how about populating of such a reference
tables and what can be optimized in this situation.

Maybe the generate_series() procedure to unnest the array is the place
where I loose the performance?

Are the set update/inserts more effitient, then single inserts/updates
run in smaller loops?

Thanks for your help,

Valentine Gogichashvili


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly



Re: [PERFORM] Key/Value reference table generation: INSERT/UPDATE performance

2007-05-22 Thread Richard Huxton

valgog wrote:

I found several post about INSERT/UPDATE performance in this group,
but actually it was not really what I am searching an answer for...

I have a simple reference table WORD_COUNTS that contains the count of
words that appear in a word array storage in another table.


I think this is the root of your problem, I'm afraid. You're trying to 
count individual words when you're storing an array of words. I don't 
think any of the Gist/GIN indexes will help you with this either.


However, since "you don't want to start from here" isn't very useful 
here and now:


1. See what the performance (explain analyse) of the "select 
distinct...generate_series()" statement is. I think you're right and 
it's going to be slow.
2. You're looping through each row of word_storage and counting 
separately. Write it as one query if possible.
3. As Peter says, don't insert then update, start with an empty table 
and just insert totals for the lot (see #2).


I'd probably write the query in plperl/python or something else that 
supports hash/dictionary structures. Then just process the whole 
word_storage into the hash - assuming you only have a few thousand 
distinct words that shouldn't take up too much memory.


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


[PERFORM] is file size relevant in choosing index or table scan?

2007-05-22 Thread Joost Kraaijeveld
Hi,

I have a table with a file size of 400 MB with an index of 100 MB. Does 
PostgreSQL take the file sizes of both the table and the index into account 
when determing if it should do a table or an index scan? 

TIA

Joost

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] is file size relevant in choosing index or table scan?

2007-05-22 Thread Richard Huxton

Joost Kraaijeveld wrote:

Hi,

I have a table with a file size of 400 MB with an index of 100 MB.
Does PostgreSQL take the file sizes of both the table and the index
into account when determing if it should do a table or an index scan?


In effect yes, although it will think in terms of row sizes and disk 
blocks. It also considers how many rows it thinks it will fetch and 
whether the rows it wants are together or spread amongst many blocks. It 
also tries to estimate what the chances are of those blocks being cached 
in RAM vs still on disk.


So: 1 row from a 4 million row table, accessed by primary key => index.
20 rows from a 200 row table => seq scan (probably).
In between => depends on your postgresql.conf

--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Key/Value reference table generation: INSERT/UPDATE performance

2007-05-22 Thread valgog
I have rewritten the code like

  existing_words_array := ARRAY( select word
   from WORD_COUNTS
  where word = ANY
( array_of_words ) );
  not_existing_words_array := ARRAY( select distinct_word
   from ( select distinct
(array_of_words)[s.index] as distinct_word
from
generate_series(1, array_upper( array_of_words, 1 ) ) as s(index)
 ) as distinct_words
  where distinct_word <> ALL
( existing_words_array ) );
  -- insert the missing words
  if not_existing_words_array is not null then
insert into WORD_COUNTS
( word, count )
( select word, 1
from ( select
not_existing_words_array[s.index] as word
 from generate_series( 1,
array_upper( not_existing_words_array, 1 ) ) as s(index) ) as
distinct_words
 );
  end if;
  -- update the counts
  if existing_words_array is not null then
update WORD_COUNTS
   set count = COALESCE( count, 0 ) + 1
 where sw_word = ANY ( existing_words_array );
  end if;


Now it processes a million records in 14 seconds... so it was probably
the problem of looking up NOT IN WORD_COUNTS was way too expencive


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Postgres Benchmark Results

2007-05-22 Thread PFC
Note that while the average hits/s between 100 and 500 is over 600 tps  
for
Postgres there is a consistent smattering of plot points spread all the  
way

down to 200 tps, well below the 400-500 tps that MySQL is getting.


Yes, these are due to checkpointing, mostly.
	Also, note that a real forum would not insert 100 posts/s, so it would  
not feel this effect. But in order to finish the benchmark in a correct  
amount of time, we have to push on the inserts.


Some of those are undoubtedly caused by things like checkpoints and  
vacuum
runs. Hopefully the improvements that are already in the pipeline will  
reduce

them.


	I am re-running it with other tuning, notably cost-based vacuum delay and  
less frequent checkpoints, and it is a *lot* smoother.
	These take a full night to run, so I'll post more results when I have  
usefull stuff to show.

This has proven to be a very interesting trip to benchmarkland...

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Key/Value reference table generation: INSERT/UPDATE performance

2007-05-22 Thread PFC

On Tue, 22 May 2007 10:23:03 +0200, valgog <[EMAIL PROTECTED]> wrote:


I found several post about INSERT/UPDATE performance in this group,
but actually it was not really what I am searching an answer for...

I have a simple reference table WORD_COUNTS that contains the count of
words that appear in a word array storage in another table.


Mmm.

If I were you, I would :

	- Create a procedure that flattens all the arrays and returns all the  
words :


PROCEDURE flatten_arrays RETURNS SETOF TEXT
FOR word_array IN SELECT word_array FROM your_table LOOP
FOR i IN 1...array_upper( word_array ) LOOP
RETURN NEXT tolower( word_array[ i ] )

So, SELECT * FROM flatten_arrays() returns all the words in all the arrays.
To get the counts quickly I'd do this :

SELECT word, count(*) FROM flatten_arrays() AS word GROUP BY word

You can then populate your counts table very easily and quickly, since  
it's just a seq scan and hash aggregate. One second for 10.000 rows would  
be slow.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Key/Value reference table generation: INSERT/UPDATE performance

2007-05-22 Thread valgog
On May 22, 12:14 pm, [EMAIL PROTECTED] (PFC) wrote:
> On Tue, 22 May 2007 10:23:03 +0200, valgog <[EMAIL PROTECTED]> wrote:
> > I found several post about INSERT/UPDATE performance in this group,
> > but actually it was not really what I am searching an answer for...
>
> > I have a simple reference table WORD_COUNTS that contains the count of
> > words that appear in a word array storage in another table.
>
> Mmm.
>
> If I were you, I would :
>
> - Create a procedure that flattens all the arrays and returns all the 
>  
> words :
>
> PROCEDURE flatten_arrays RETURNS SETOF TEXT
> FOR word_array IN SELECT word_array FROM your_table LOOP
> FOR i IN 1...array_upper( word_array ) LOOP
> RETURN NEXT tolower( word_array[ i ] )
>
> So, SELECT * FROM flatten_arrays() returns all the words in all the arrays.
> To get the counts quickly I'd do this :
>
> SELECT word, count(*) FROM flatten_arrays() AS word GROUP BY word
>
> You can then populate your counts table very easily and quickly, since  
> it's just a seq scan and hash aggregate. One second for 10.000 rows would  
> be slow.
>
> ---(end of broadcast)---
> TIP 3: Have you checked our extensive FAQ?
>
>http://www.postgresql.org/docs/faq

good idea indeed! will try this approach.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Key/Value reference table generation: INSERT/UPDATE performance

2007-05-22 Thread valgog
On May 22, 12:00 pm, valgog <[EMAIL PROTECTED]> wrote:
> I have rewritten the code like
>
>   existing_words_array := ARRAY( select word
>from WORD_COUNTS
>   where word = ANY
> ( array_of_words ) );
>   not_existing_words_array := ARRAY( select distinct_word
>from ( select distinct
> (array_of_words)[s.index] as distinct_word
> from
> generate_series(1, array_upper( array_of_words, 1 ) ) as s(index)
>  ) as distinct_words
>   where distinct_word <> ALL
> ( existing_words_array ) );
>   -- insert the missing words
>   if not_existing_words_array is not null then
> insert into WORD_COUNTS
> ( word, count )
> ( select word, 1
> from ( select
> not_existing_words_array[s.index] as word
>  from generate_series( 1,
> array_upper( not_existing_words_array, 1 ) ) as s(index) ) as
> distinct_words
>  );
>   end if;
>   -- update the counts
>   if existing_words_array is not null then
> update WORD_COUNTS
>set count = COALESCE( count, 0 ) + 1
>  where sw_word = ANY ( existing_words_array );
>   end if;
>
> Now it processes a million records in 14 seconds... so it was probably
> the problem of looking up NOT IN WORD_COUNTS was way too expencive

Sorry... this code did not update anythig at all, as I forgot about
the NULL values... had to COALASCE practically everything and use
array_upper()... do not have the performance numbers of the insert,
updates yet...


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[PERFORM] Performace comparison of indexes over timestamp fields

2007-05-22 Thread Arnau

Hi all,

  I have some tables where all the queries that will be executed are 
timestamps driven, so it'd be nice to have an index over those fields.


  On older versions of PostgreSQL, at least in my experience, queries 
on timestamps fields even having indexes where performing quite bad 
mainly sequential scans where performed.


  Now I have a newer version of PostgreSQL and I've done some tests 
comparing the performance of an index over a timestamp field with a 
numeric field. To do so, I have the following table:


 Table "public.payment_transactions"
Column |Type |Modifiers
+-+-
transaction_id | character varying(32)   | not null
timestamp_in   | timestamp without time zone | default now()
credits| integer |
epoch_in   | bigint  |
epoch_in2  | double precision|
Indexes:
   "pk_paytrans_transid" PRIMARY KEY, btree (transaction_id)
   "idx_paytrans_epochin" btree (epoch_in)
   "idx_paytrans_epochin2" btree (epoch_in2)
   "idx_paytrans_timestamp" btree (timestamp_in)

timestamp_in it's the timestamp, epoch_in and epoch_in2 are the epoch 
equivalent to timestamp to test how the indexes perform. We have three 
different indexes (testing purposes) one over a timestamp field, one 
over an int8 and one over a double precision field.


While doing the tests this table has about 100.000 entries.



To test the diferent indexes I have executed the following:

Index over timestamp_in (timestamp)

# explain analyze select * from payment_transactions where timestamp_in 
between '2007-02-13'::timestamp and '2007-02-15'::timestamp;


  QUERY PLAN
---
Index Scan using idx_paytrans_timestamp on payment_transactions 
(cost=0.00..1480.24 rows=1698 width=138) (actual time=11.693..310.402 
rows=1587 loops=1)
  Index Cond: ((timestamp_in >= '2007-02-13 00:00:00'::timestamp 
without time zone) AND (timestamp_in <= '2007-02-15 00:00:00'::timestamp 
without time zone))

Total runtime: 318.328 ms
(3 rows)


Index over epoch_in (int8)

# explain analyze select * from payment_transactions where epoch_in 
between extract( epoch from '2007-02-13'::date )::int8 and extract( 
epoch from '2007-02-15'::date )::int8;


QUERY PLAN
-
Index Scan using idx_paytrans_epochin on payment_transactions 
(cost=0.00..1483.24 rows=1698 width=138) (actual time=34.369..114.943 
rows=1587 loops=1)
  Index Cond: ((epoch_in >= 1171321200::bigint) AND (epoch_in <= 
1171494000::bigint))

Total runtime: 120.804 ms
(3 rows)


Index over epoch_in (double precision)

# explain analyze select * from payment_transactions where epoch_in2 
between extract( epoch from '2007-02-13'::date ) and extract( epoch from 
'2007-02-15'::date );


QUERY PLAN
-
Index Scan using idx_paytrans_epochin2 on payment_transactions 
(cost=0.00..1479.24 rows=1698 width=138) (actual time=26.115..51.357 
rows=1587 loops=1)
  Index Cond: ((epoch_in2 >= 1171321200::double precision) AND 
(epoch_in2 <= 1171494000::double precision))

Total runtime: 57.065 ms
(3 rows)



As you can see the time difference are very big
  Timestamp:318.328 ms
  int8 index:   120.804 ms
  double precision: 57.065 ms

is this normal? am I doing anything wrong?

As rule of thumb is better to store epochs than timestamps?

Thank you very much
--
Arnau

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Key/Value reference table generation: INSERT/UPDATE performance

2007-05-22 Thread cedric
Le mardi 22 mai 2007, Richard Huxton a écrit :
> valgog wrote:
> > I found several post about INSERT/UPDATE performance in this group,
> > but actually it was not really what I am searching an answer for...
> >
> > I have a simple reference table WORD_COUNTS that contains the count of
> > words that appear in a word array storage in another table.
>
> I think this is the root of your problem, I'm afraid. You're trying to
> count individual words when you're storing an array of words. I don't
> think any of the Gist/GIN indexes will help you with this either.
>
> However, since "you don't want to start from here" isn't very useful
> here and now:
>
> 1. See what the performance (explain analyse) of the "select
> distinct...generate_series()" statement is. I think you're right and
> it's going to be slow.
> 2. You're looping through each row of word_storage and counting
> separately. Write it as one query if possible.
> 3. As Peter says, don't insert then update, start with an empty table
> and just insert totals for the lot (see #2).
>
> I'd probably write the query in plperl/python or something else that
> supports hash/dictionary structures. Then just process the whole
> word_storage into the hash - assuming you only have a few thousand
> distinct words that shouldn't take up too much memory.
+1
I made something very similar, and using PL/pgsql is very slow, when using 
perl is very quick. 

I have also use partioning because of cost of update (copy last partition to 
the new, adding the new count, so there is only insert, and drop old table if 
you want)

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Performace comparison of indexes over timestamp fields

2007-05-22 Thread Alexander Staubo

On 5/22/07, Arnau <[EMAIL PROTECTED]> wrote:

   On older versions of PostgreSQL, at least in my experience, queries
on timestamps fields even having indexes where performing quite bad
mainly sequential scans where performed.


PostgreSQL uses B-tree indexes for scalar values. For an expression
such as "t between a and b", I believe it's going to match both sides
of the table independently (ie., t >= a and t <= b) and intersect
these subsets. This is inefficient.

You should get better performance by mapping timestamps to a
one-dimensional plane and indexing them using GiST. GiST implements an
R-tree-like structure that supports bounding-box searches.

This involves setting up a functional index:

 create index ... on payment_transactions using gist (
   box(point(extract(epoch from time), 0), point(extract(epoch from
time), 0)) box_ops)

I'm using box() here because GiST doesn't have a concept of points.

Then insert as usual, and then query with something like:

 select ... from payment_transactions
 where box(
   point(extract(epoch from '2006-04-01'::date), 0),
   point(extract(epoch from '2006-08-01'::date), 0)) && box(
   point(extract(epoch from time), 0),
   point(extract(epoch from time), 0));

PostgreSQL should be able to exploit the GiST index by recognizing
that the result of box() expression operand is already computed in the
index.

This much less inconvenient and portable -- I would love for
PostgreSQL to be provide syntactic sugar and special-casing to make
this transparent -- but worth it if you are dealing with a lot of
range searches.


   Now I have a newer version of PostgreSQL and I've done some tests
comparing the performance of an index over a timestamp field with a
numeric field. To do so, I have the following table:

  Table "public.payment_transactions"
 Column |Type |Modifiers
+-+-
transaction_id | character varying(32)   | not null
timestamp_in   | timestamp without time zone | default now()
credits| integer |
epoch_in   | bigint  |
epoch_in2  | double precision|

[snip]

A timestamp is stored internally as an 8-byte double-precision float.
Therefore, timestamp_in and epoch_in2 should behave identically.


While doing the tests this table has about 100.000 entries.


Make sure PostgreSQL is able to keep the entire table in memory by
setting shared_buffers; you don't want to be hitting to the disk.

Make sure you run "analyze" on the table before you execute the test.


To test the diferent indexes I have executed the following:


Your query plans are roughly identical. The difference in the timings
implies that you only ran the queries once. I suggest you run each
query at least 10 times, and report the individual numbers (the "total
runtime" parts of the output) you get. Arithmetic means are not that
interesting.

Alexander.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Performace comparison of indexes over timestamp fields

2007-05-22 Thread Steinar H. Gunderson
On Tue, May 22, 2007 at 02:39:33PM +0200, Alexander Staubo wrote:
> PostgreSQL uses B-tree indexes for scalar values. For an expression
> such as "t between a and b", I believe it's going to match both sides
> of the table independently (ie., t >= a and t <= b) and intersect
> these subsets. This is inefficient.

A B-tree index can satisfy range queries such as this.

> You should get better performance by mapping timestamps to a
> one-dimensional plane and indexing them using GiST. GiST implements an
> R-tree-like structure that supports bounding-box searches.

You may be thinking of interval overlaps?

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Performace comparison of indexes over timestamp fields

2007-05-22 Thread Alexander Staubo

On 5/22/07, Steinar H. Gunderson <[EMAIL PROTECTED]> wrote:

On Tue, May 22, 2007 at 02:39:33PM +0200, Alexander Staubo wrote:
> PostgreSQL uses B-tree indexes for scalar values. For an expression
> such as "t between a and b", I believe it's going to match both sides
> of the table independently (ie., t >= a and t <= b) and intersect
> these subsets. This is inefficient.

A B-tree index can satisfy range queries such as this.


You're right, and I'm wrong -- my head is not in the right place
today. B-trees are inefficient for intervals, but quite satisfactory
for range searches.

Alexander.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


[PERFORM] Tips & Tricks for validating hardware/os

2007-05-22 Thread Stephane Bailliez

Hi,

Out of curiosity, can anyone share his tips & tricks to validate a 
machine before labelling it as 'ready to use postgres - you probably 
won't trash my data today' ?
I'm looking for a way to stress test components especially kernel/disk 
to have confidence > 0 that I can use postgres on top of it.


Any secret trick is welcome (beside the memtest one :)

Thanks !

-- stephane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [PERFORM] Tips & Tricks for validating hardware/os

2007-05-22 Thread Alexander Staubo

On 5/22/07, Stephane Bailliez <[EMAIL PROTECTED]> wrote:

Out of curiosity, can anyone share his tips & tricks to validate a
machine before labelling it as 'ready to use postgres - you probably
won't trash my data today' ?
I'm looking for a way to stress test components especially kernel/disk
to have confidence > 0 that I can use postgres on top of it.

Any secret trick is welcome (beside the memtest one :)


Compile the Linux kernel -- it's a pretty decent stress test.

You could run pgbench, which comes with PostgreSQL (as part of the
contrib package). Give a database size that's larger than the amount
of physical memory in the box.

Alexander.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Tips & Tricks for validating hardware/os

2007-05-22 Thread PFC



Out of curiosity, can anyone share his tips & tricks to validate a
machine before labelling it as 'ready to use postgres - you probably
won't trash my data today' ?
I'm looking for a way to stress test components especially kernel/disk
to have confidence > 0 that I can use postgres on top of it.


	That would be running a filesystem benchmark, pulling the plug, then  
counting the dead.


http://sr5tech.com/write_back_cache_experiments.htm

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Performace comparison of indexes over timestamp fields

2007-05-22 Thread Tom Lane
Arnau <[EMAIL PROTECTED]> writes:
> As you can see the time difference are very big
>Timestamp:318.328 ms
>int8 index:   120.804 ms
>double precision: 57.065 ms

As already suggested elsewhere, you probably weren't sufficiently
careful in taking your measurements.

A look at the code says that int8 comparison ought to be the fastest
of these.  If timestamps are implemented as floats (which you didn't
say) their comparison speed ought to be *exactly* the same as floats,
because the comparison functions are line-for-line the same.  If
timestamps are implemented as int8 then they should be similar to
int8 comparisons, maybe a tad slower due to an extra level of function
call.  But in any case it seems likely that the actual comparison
function calls would be just a small fraction of the runtime.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[PERFORM] Domains versus Check Constraints

2007-05-22 Thread Chander Ganesan
Are there any performance improvements that come from using a domain 
over a check constraint (aside from the ease of management component)?


thanks

--
Chander Ganesan
Open Technology Group, Inc.
One Copley Parkway, Suite 210
Morrisville, NC  27560
Phone: 877-258-8987/919-463-0999


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PERFORM] Drop table vs Delete record

2007-05-22 Thread Orhan Aglagul

My application has two threads, one inserts thousands of  records per second 
into a table  (t1) and the  other thread periodically deletes expired records 
(also in thousands) from the same table (expired ones).  So, we have one thread 
adding a row while the other thread is trying to delete a row. In a short time 
the overall performance of any sql statements on that instance degrades. (ex.  
Select count(*) from  t1 takes  more then few seconds with less than 10K rows).

My question is: Would any sql statement perform better if I would rename the 
table to t1_%indx periodically, create a new table t1 (for new inserts) and 
just drop the tables with expired records rather then doing a delete record? 
(t1 is a simple table with many rows and no constraints). 

(I know I could run vacuum analyze) 

Thanks,

Orhan A.   


Re: [PERFORM] Drop table vs Delete record

2007-05-22 Thread Andreas Kostyrka
Consider table partitioning (it's described in the manual).

Andreas

-- Ursprüngl. Mitteil. --
Betreff:[PERFORM] Drop table vs Delete record
Von:"Orhan Aglagul" <[EMAIL PROTECTED]>
Datum:  22.05.2007 18:42


My application has two threads, one inserts thousands of  records per second 
into a table  (t1) and the  other thread periodically deletes expired records 
(also in thousands) from the same table (expired ones).  So, we have one thread 
adding a row while the other thread is trying to delete a row. In a short time 
the overall performance of any sql statements on that instance degrades. (ex.  
Select count(*) from  t1 takes  more then few seconds with less than 10K rows).

My question is: Would any sql statement perform better if I would rename the 
table to t1_%indx periodically, create a new table t1 (for new inserts) and 
just drop the tables with expired records rather then doing a delete record? 
(t1 is a simple table with many rows and no constraints). 

(I know I could run vacuum analyze) 

Thanks,

Orhan A.   


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Drop table vs Delete record

2007-05-22 Thread Orhan Aglagul
Checking out right now
Thanks for the fast response.

-Original Message-
From: Andreas Kostyrka [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, May 22, 2007 11:49 AM
To: Orhan Aglagul
Cc: 
Subject: AW: [PERFORM] Drop table vs Delete record

Consider table partitioning (it's described in the manual).

Andreas

-- Ursprüngl. Mitteil. --
Betreff:[PERFORM] Drop table vs Delete record
Von:"Orhan Aglagul" <[EMAIL PROTECTED]>
Datum:  22.05.2007 18:42


My application has two threads, one inserts thousands of  records per second 
into a table  (t1) and the  other thread periodically deletes expired records 
(also in thousands) from the same table (expired ones).  So, we have one thread 
adding a row while the other thread is trying to delete a row. In a short time 
the overall performance of any sql statements on that instance degrades. (ex.  
Select count(*) from  t1 takes  more then few seconds with less than 10K rows).

My question is: Would any sql statement perform better if I would rename the 
table to t1_%indx periodically, create a new table t1 (for new inserts) and 
just drop the tables with expired records rather then doing a delete record? 
(t1 is a simple table with many rows and no constraints). 

(I know I could run vacuum analyze) 

Thanks,

Orhan A.   


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Postgres Benchmark Results

2007-05-22 Thread Greg Smith

On Tue, 22 May 2007, Gregory Stark wrote:


However as mentioned a while back in practice it doesn't work quite right and
you should expect to get 1/2 the expected performance. So even with 10 clients
you should expect to see 5*120 tps on a 7200 rpm drive and 5*250 tps on a
15kprm drive.


I would agree that's the approximate size of the upper-bound.  There are 
so many factors that go into the effectiveness of commit_delay that I 
wouldn't word it so strongly as to say you can "expect" that much benefit. 
The exact delay amount (which can be hard to set if your client load 
varies greatly), size of the transactions, balance of seek-bound reads vs. 
memory based ones in the transactions, serialization in the transaction 
stream, and so many other things can slow the effective benefit.


Also, there are generally other performance issues in the types of systems 
you would think would get the most benefit from this parameter that end up 
slowing things down anyway.  I've been seeing a best case of closer to 
2*single tps rather than 5* on my single-drive systems with no write 
caching, but I'll admit I haven't done an exhausting look at it yet (too 
busy with the real systems that have good controllers).  One of these 
days...


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Tips & Tricks for validating hardware/os

2007-05-22 Thread Andreas Kostyrka
You forgot pulling some RAID drives at random times to see how the hardware 
deals with the fact. And how it deals with the rebuild afterwards. (Many RAID 
solutions leave you with worst of both worlds, taking longer to rebuild than a 
restore from backup would take, while at the same ime providing a disc IO 
performance that is SO bad that the server becomes useless during the rebuild)

Andreas

-- Ursprüngl. Mitteil. --
Betreff:Re: [PERFORM] Tips & Tricks for validating hardware/os
Von:Greg Smith <[EMAIL PROTECTED]>
Datum:  23.05.2007 05:15

On Tue, 22 May 2007, Stephane Bailliez wrote:

> Out of curiosity, can anyone share his tips & tricks to validate a machine 
> before labelling it as 'ready to use postgres - you probably won't trash my 
> data today' ?

Write a little script that runs pgbench in a loop forever.  Set your 
shared_buffer cache to use at least 50% of the memory in the machine, and 
adjust the database size and concurrent clients so it's generating a 
substantial amount of disk I/O and using a fair amount of the CPU.

Install the script so that it executes on system startup, like adding it 
to rc.local Put the machine close to your desk.  Every time you walk by 
it, kill the power and then start it back up.  This will give you a mix of 
long overnight runs with no interruption to stress the overall system, 
with a nice dose of recovery trauma.  Skim the Postgres and OS log files 
every day.  Do that for a week, if it's still running your data should be 
safe under real conditions.

--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Tips & Tricks for validating hardware/os

2007-05-22 Thread Greg Smith

On Tue, 22 May 2007, Stephane Bailliez wrote:

Out of curiosity, can anyone share his tips & tricks to validate a machine 
before labelling it as 'ready to use postgres - you probably won't trash my 
data today' ?


Write a little script that runs pgbench in a loop forever.  Set your 
shared_buffer cache to use at least 50% of the memory in the machine, and 
adjust the database size and concurrent clients so it's generating a 
substantial amount of disk I/O and using a fair amount of the CPU.


Install the script so that it executes on system startup, like adding it 
to rc.local Put the machine close to your desk.  Every time you walk by 
it, kill the power and then start it back up.  This will give you a mix of 
long overnight runs with no interruption to stress the overall system, 
with a nice dose of recovery trauma.  Skim the Postgres and OS log files 
every day.  Do that for a week, if it's still running your data should be 
safe under real conditions.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate