[PERFORM] performance tuning queries

2008-11-26 Thread Kevin Kempter
Hi All;

I'm looking for tips / ideas per performance tuning some specific queries. 
These are generally large tables on a highly active OLTP system 
(100,000 - 200,000 plus queries per day)

First off, any thoughts per tuning inserts into large tables. I have a large 
table with an insert like this:

insert into public.bigtab1 (text_col1, text_col2, id) values ...

QUERY PLAN
--
 Result  (cost=0.00..0.01 rows=1 width=0)
(1 row)

The query cost is low but this is one of the slowest statements per pgfouine







Next we have a select count(*) that  also one of the top offenders:

select count(*) from public.tab3  where user_id=31 
and state='A' 
and amount>0;

 QUERY PLAN 
 
-
 Aggregate  (cost=3836.53..3836.54 rows=1 width=0)
   ->  Index Scan using order_user_indx ontab3 user_id  (cost=0.00..3834.29 
rows=897 width=0)
 Index Cond: (idx_user_id = 31406948::numeric)
 Filter: ((state = 'A'::bpchar) AND (amount > 0::numeric))
(4 rows)

We have an index on the user_id but not on the state or amount, 

add index to amount ?



Thoughts ?







-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Partition table query performance

2008-11-26 Thread Gregory Stark
"Greg Jaman" <[EMAIL PROTECTED]> writes:

> I have a problem with partitioning and I'm wondering if anyone can provide
> some insight.   I'm trying to find the max value of a column across multiple
> partitions.  The query against the partition set is quite slow while queries
> against child partitions is very fast!

I'm afraid this is a known problematic use case of Postgres's current
partitioning support. Postgres is not capable of finding the plan which you're
undoubtedly looking for where it uses the same plan as your child table query
iterating over the partitions.

There are several groups working to improve this in different ways but none of
them appear to be on track to be in 8.4 so it will be 8.5 or later before they
appear. Sorry.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's 24x7 Postgres support!

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Memory Allocation

2008-11-26 Thread Scott Carey
Swappiness optimization is going to vary.   Definitely test on your own.

For a bulk load database, with large page cache, swappines = 60 (default) is 
_GUARANTEED_ to force the OS to swap out some of Postgres while in heavy use.  
This is heavily dependent on the page cache size, work_mem size, and 
concurrency.
I've had significantly increased performance setting this value low (1000x ! -- 
if your DB starts swapping postgres, you're performance-DEAD).  The default has 
the OS targeting close to 60% of the memory for page cache.  On a 32GB server, 
with 7GB postgres buffer cache, several concurrent queries reading GB's of data 
and using 500MB + work_mem (huge aggregates), the default swappiness will 
choose to page out postgres with about 19GB of disk page cache left to evict, 
with disastrous results.  And that is a read-only test.  Tests with writes can 
trigger it earlier if combined with bad dirty_buffers settings.

The root of the problem is that the Linux paging algorithm estimates that I/O 
for file read access is as costly as I/O for paging.  A reasonable assumption 
for a desktop, a ridiculously false assumption for a large database with high 
capacity DB file I/O and a much lower capability swap file.  Not only that -- 
page in is almost always near pure random reads, but DB I/O is often 
sequential.  So losing 100M of cached db file takes a lot less time to scan 
back in than 100MB of the application.

If you do have enough other applications that are idle that take up RAM that 
should be pushed out to disk from time to time (perhaps your programs that are 
doing the bulk loading?) a higher value is useful.  Although it is not exact, 
think of the swappiness value as the percentage of RAM that the OS would prefer 
page cache to applications (very roughly).

The more RAM you have and the larger your postgres memory usage, the lower the 
swappiness value should be.  60% of 24GB is ~14.5GB, If you have that much 
stuff that is in RAM that should be paged out to save space, try it.

I currently use a value of 1, on a 32GB machine, and about 600MB of 'stuff' 
gets paged out normally, 1400MB under heavy load.  This is a dedicated machine. 
 Higher values page out more stuff that increases the cache size and helps 
performance a little, but under the heavy load, it hits the paging wall and 
falls over.  The small improvement in performance when the system is not 
completely stressed is not worth risking hitting the wall for me.

***For a bulk load database, one is optimizing for _writes_ and extra page 
cache doesn't help writes like it does reads.***

When I use a machine with misc. other lower priority apps and less RAM, I have 
found larger values to be helpful.

If your DB is configured with a low shared_buffers and small work_mem, you 
probably want the OS to use that much memory for disk pages, and again a higher 
swappiness may be more optimal.

Like all of these settings, tune to your application and test.  Many of these 
settings are things that go hand in hand with others, but alone don't make as 
much sense.  Tuning Postgres to do most of the caching and making the OS get 
out of the way is far different than tuning the OS to do as much caching work 
as possible and minimizing postgres.  Which of those two strategies is best is 
highly application dependent, somewhat O/S dependent, and also hardware 
dependent.

-Original Message-
From: Kevin Grittner [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 26, 2008 3:09 PM
To: Ryan Hansen; pgsql-performance@postgresql.org; Scott Carey
Subject: Re: [PERFORM] Memory Allocation

>>> Scott Carey <[EMAIL PROTECTED]> wrote:
> Set swappiness to 0 or 1.

We recently converted all 72 remote county databases from 8.2.5 to
8.3.4.  In preparation we ran a test conversion of a large county over
and over with different settings to see what got us the best
performance.  Setting swappiness below the default degraded
performance for us in those tests for identical data, same hardware,
no other changes.

Our best guess is that code which really wasn't getting called got
swapped out leaving more space in the OS cache, but that's just a
guess.  Of course, I'm sure people would not be recommending it if
they hadn't done their own benchmarks to confirm that this setting
actually improved things in their environments, so the lesson here is
to test for your environment when possible.

-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Memory Allocation

2008-11-26 Thread Kevin Grittner
>>> Scott Carey <[EMAIL PROTECTED]> wrote: 
> Set swappiness to 0 or 1.
 
We recently converted all 72 remote county databases from 8.2.5 to
8.3.4.  In preparation we ran a test conversion of a large county over
and over with different settings to see what got us the best
performance.  Setting swappiness below the default degraded
performance for us in those tests for identical data, same hardware,
no other changes.
 
Our best guess is that code which really wasn't getting called got
swapped out leaving more space in the OS cache, but that's just a
guess.  Of course, I'm sure people would not be recommending it if
they hadn't done their own benchmarks to confirm that this setting
actually improved things in their environments, so the lesson here is
to test for your environment when possible.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Memory Allocation

2008-11-26 Thread Scott Carey
Tuning for bulk loading:

Make sure the Linux kernel paramters in /proc/sys/vm related to the page cache 
are set well.
Set swappiness to 0 or 1.
Make sure you understand and configure /proc/sys/vm/dirty_background_ratio
and /proc/sys/vm/dirty_ratio well.
With enough RAM the default on some kernel versions is way, way off (40% of RAM 
with dirty pages!  yuck).
http://www.westnet.com/~gsmith/content/linux-pdflush.htm
If postgres is doing a lot of caching for you you probably want dirty_ratio at 
10% or less, and you'll want the OS to start flushing to disk sooner rather 
than later.  A dirty_background_ratio of 3% with 24GB of RAM  is 720MB -- a 
pretty big buffer.  I would not personally want this buffer to be larger than 5 
seconds of max write speed of the disk I/O.

You'll need to tune your background writer to be aggressive enough to actually 
write data fast enough so that checkpoints don't suck, and tune your checkpoint 
size and settings as well.  Turn on checkpoint logging on the database and run 
tests while looking at the output of those.  Ideally, most of your batch writes 
have made it to the OS before the checkpoint, and the OS has actually started 
moving most of it to disk.  If your settings are wrong,  you'll have the data 
buffered twice, and most or nearly all of it will be in memory when the 
checkpoint happens, and the checkpoint will take a LONG time.  The default 
Linux settings + default postgres settings + large shared_buffers will almost 
guarantee this situation for bulk loads.  Both have to be configured with 
complementary settings.  If you have a large postgres buffer, the OS buffer 
should be small and write more aggressively.  If you have a small postgres 
buffer, the OS can be more lazy and cache much more.


From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Ryan Hansen
Sent: Wednesday, November 26, 2008 2:10 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Memory Allocation

Hey all,

This may be more of a Linux question than a PG question, but I'm wondering if 
any of you have successfully allocated more than 8 GB of memory to PG before.

I have a fairly robust server running Ubuntu Hardy Heron, 24 GB of memory, and 
I've tried to commit half the memory to PG's shared buffer, but it seems to 
fail.  I'm setting the kernel shared memory accordingly using sysctl, which 
seems to work fine, but when I set the shared buffer in PG and restart the 
service, it fails if it's above about 8 GB.  I actually have it currently set 
at 6 GB.

I don't have the exact failure message handy, but I can certainly get it if 
that helps.  Mostly I'm just looking to know if there's any general reason why 
it would fail, some inherent kernel or db limitation that I'm unaware of.

If it matters, this DB is going to be hosting and processing hundreds of GB and 
eventually TB of data, it's a heavy read-write system, not transactional 
processing, just a lot of data file parsing (python/bash) and bulk loading.  
Obviously the disks get hit pretty hard already, so I want to make the most of 
the large amount of available memory wherever possible.  So I'm trying to tune 
in that direction.

Any info is appreciated.

Thanks!


Re: [PERFORM] Memory Allocation

2008-11-26 Thread Tom Lane
"Ryan Hansen" <[EMAIL PROTECTED]> writes:
> I have a fairly robust server running Ubuntu Hardy Heron, 24 GB of memory,
> and I've tried to commit half the memory to PG's shared buffer, but it seems
> to fail.  I'm setting the kernel shared memory accordingly using sysctl,
> which seems to work fine, but when I set the shared buffer in PG and restart
> the service, it fails if it's above about 8 GB.

Fails how?  And what PG version is that?

FWIW, while there are various schools of thought on how large to make
shared_buffers, pretty much everybody agrees that half of physical RAM
is not the sweet spot.  What you're likely to get is maximal
inefficiency with every active disk page cached twice --- once in kernel
space and once in shared_buffers.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Memory Allocation

2008-11-26 Thread Carlos Moreno
Ryan Hansen wrote:
>
> Hey all,
>
> This may be more of a Linux question than a PG question, but I’m
> wondering if any of you have successfully allocated more than 8 GB of
> memory to PG before.
>
> I have a fairly robust server running Ubuntu Hardy Heron, 24 GB of
> memory, and I’ve tried to commit half the memory to PG’s shared
> buffer, but it seems to fail.
>

Though not sure why this is happening or whether it is normal, I would
suggest that such setting is maybe too high. From the Annotated
postgresql.conf document at

http://www.powerpostgresql.com/Downloads/annotated_conf_80.html,

the suggested range is 8 to 400MB. They specifically say that it
should never be set to more than 1/3 of the available memory, which
in your case is precisely the 8GB figure (I guess that's just a
coincidence --- I doubt that the server would be written so that it
fails to start if shared_buffers is more than 1/3 of available RAM)

Another important parameter that you don't mention is the
effective_cache_size, which that same document suggests should
be about 2/3 of available memory. (this tells the planner the amount
of data that it can "probabilistically" expect to reside in memory due
to caching, and as such, the planner is likely to produce more
accurate estimates and thus better query optimizations).

Maybe you could set shared_buffers to, say, 1 or 2GB (that's already
beyond the recommended figure, but given that you have 24GB, it
may not hurt), and then effective_cache_size to 16GB or so?

HTH,

Carlos
--


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Increasing pattern index query speed

2008-11-26 Thread Scott Carey
> I used 1000 since doc wrote that max value is 1000
> Rid table contains 3.5millions rows, will increase 1 millions of rows per
> year and is updated frequently, mostly by adding.

> Is it OK to leave

> SET STATISTICS 1000;

> setting for this table this column or should  I try to decrease it ?

> Andrus.

If you expect millions of rows, and this is one of your most important use 
cases, leaving that column's statistics target at 1000 is probably fine.  You 
will incur a small cost on most queries that use this column (query planning is 
more expensive as it may have to scan all 1000 items for a match), but the risk 
of a bad query plan and a very slow query is a lot less.

It is probably worth the small constant cost to prevent bad queries in your 
case, and since the table will be growing.  Larger tables need larger 
statistics common values buckets in general.

Leave this at 1000, focus on your other issues first.  After all the other 
major issues are done you can come back and see if a smaller value is worth 
trying or not.

You may also end up setting higher statistics targets on some other columns to 
fix other issues.  You may want to set the value in the configuration file 
higher than the default 10 -- I'd recommend starting with 40 and re-analyzing 
the tables.  Going from 10 to 40 has a minor cost but can help the planner 
create significantly better queries if you have skewed data distributions.

-Scott

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Memory Allocation

2008-11-26 Thread Alan Hodgson
On Wednesday 26 November 2008, "Ryan Hansen" 
<[EMAIL PROTECTED]> wrote:
> This may be more of a Linux question than a PG question, but I'm
> wondering if any of you have successfully allocated more than 8 GB of
> memory to PG before.
>

CentOS 5, 24GB shared_buffers on one server here. No problems.

-- 
Alan

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Memory Allocation

2008-11-26 Thread Ryan Hansen
Hey all,

 

This may be more of a Linux question than a PG question, but I'm wondering
if any of you have successfully allocated more than 8 GB of memory to PG
before.

 

I have a fairly robust server running Ubuntu Hardy Heron, 24 GB of memory,
and I've tried to commit half the memory to PG's shared buffer, but it seems
to fail.  I'm setting the kernel shared memory accordingly using sysctl,
which seems to work fine, but when I set the shared buffer in PG and restart
the service, it fails if it's above about 8 GB.  I actually have it
currently set at 6 GB.

 

I don't have the exact failure message handy, but I can certainly get it if
that helps.  Mostly I'm just looking to know if there's any general reason
why it would fail, some inherent kernel or db limitation that I'm unaware
of.  

 

If it matters, this DB is going to be hosting and processing hundreds of GB
and eventually TB of data, it's a heavy read-write system, not transactional
processing, just a lot of data file parsing (python/bash) and bulk loading.
Obviously the disks get hit pretty hard already, so I want to make the most
of the large amount of available memory wherever possible.  So I'm trying to
tune in that direction.

 

Any info is appreciated.

 

Thanks!



Re: [PERFORM] Increasing pattern index query speed

2008-11-26 Thread Richard Huxton
Andrus wrote:
> Richard,
> 
>>> Results are provided in bottom of the message to which you replied.
>>
>> No - the explains there were contrasting a date test BETWEEN versus =.
> 
> I changed rid.toode statitics target to 100:
> 
> ALTER TABLE firma2.rid ALTER COLUMN toode SET STATISTICS 100;
> analyze firma2.rid;
> 
> Analyze takes 3 seconds and testcase rans fast.
> I'm planning to monitor results by looking log file for queries which
> take longer than 10 seconds.

Sensible. I don't know if 10 seconds is the right value for your
database, but there will be a point that filters out most of your
traffic but still gives enough to find problems.

> Do you still need results ?
> If yes, which query and how many times should I run?

If changing the statistics seems to help, you're not going to want to go
back just to repeat tests.

>> Ah, I think I understand. The test case was *missing* this clause.
> 
> I added  this clause to testcase. Also added char(70) colums containing
> padding characters to all three tables. Cannot still reproduce this issue
> in testcase in fast devel 8.3 notebook.
> In testcase order_products contains product_id values in a very regular
> order, maybe this affects the results. No idea how to use random() to
> generate random
> products for every order.

Ideally you don't even want random products. You want a distribution of
products that matches the same "shape" as you have in your production
database.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Increasing pattern index query speed

2008-11-26 Thread Andrus

Richard,


Results are provided in bottom of the message to which you replied.


No - the explains there were contrasting a date test BETWEEN versus =.


I changed rid.toode statitics target to 100:

ALTER TABLE firma2.rid ALTER COLUMN toode SET STATISTICS 100;
analyze firma2.rid;

Analyze takes 3 seconds and testcase rans fast.
I'm planning to monitor results by looking log file for queries which take 
longer than 10 seconds.


Do you still need results ?
If yes, which query and how many times should I run?


Ah, I think I understand. The test case was *missing* this clause.


I added  this clause to testcase. Also added char(70) colums containing 
padding characters to all three tables. Cannot still reproduce this issue

in testcase in fast devel 8.3 notebook.
In testcase order_products contains product_id values in a very regular 
order, maybe this affects the results. No idea how to use random() to 
generate random

products for every order.

Andrus. 



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Increasing pattern index query speed

2008-11-26 Thread Andrus

Is it OK to run

ALTER TABLE rid ALTER COLUMN toode SET STATISTICS 1000

in prod database or should I try to decrease 1000 to smaller value ?
rid is big increasing table and is changed frequently, mostly by adding
rows.


pgAdmin shows default_statistic_target value has its default value  10 in 
postgresql.conf file


Andrus. 



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Increasing pattern index query speed

2008-11-26 Thread Richard Huxton
Andrus wrote:
> Richard,
> 
>> And the results were?
> 
> Results are provided in bottom of the message to which you replied.

No - the explains there were contrasting a date test BETWEEN versus =.

>> One problem at a time. Let's get the pattern-matching speed problems on
>> your live server sorted, then we can look at different queries.
> 
> First message in this thread described the issue with query having
> additional condition
> 
> AND dok.kuupaev BETWEEN '2008-11-21' AND  '2008-11-21'

Ah, I think I understand. The test case was *missing* this clause.

> It seems that this problem occurs when pattern matching and BETWEEN
> conditions are used in same query.
> 
> According to Scott Garey great recommendation I added
> 
> ALTER TABLE rid ALTER COLUMN toode SET STATISTICS 1000;
> 
> This fixes testcase in live server, see my other message.
> Is it OK to run
> 
> ALTER TABLE rid ALTER COLUMN toode SET STATISTICS 1000
> 
> in prod database or should I try to decrease 1000 to smaller value ?
> rid is big increasing table and is changed frequently, mostly by adding
> rows.

This will try to track the 1000 most-common values of "toode", whereas
the default is to try to track the most common 10 values. Tracking more
values means the planner has more accurate information but makes ANALYSE
take longer to run, and also makes planning each query take slightly longer.

Try 100, 200, 500 and see if they work *for a range of queries* -
there's no point in having it much higher than it needs to be.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] many to many performance

2008-11-26 Thread Chavdar Kopoev
Hello,

I have following common situation:

Category IDs: about 50 000
Document IDs: about 3 000 000
Many to many relationship.
A document id have a relation with 10 up to 1000 category ids
One query, with input set of document ids, resulting set of category ids, 
having relation with input ids. (very often this query is called with more than 
500k of input document ids)

I use custom written datawarehouse, file storage, and memory kept "id offset" 
like indecies. So a query for all 3 million document ids, resulting almost all 
category ids take less than a second on desktop machine. Abstracting from 
concrete realization, query plan is like:
1. for each input document id: look up an array by document id and retrive a 
list of ralated category ids.
1.1 for each category id in the list: look up an array value by category id and 
mark it as found
2. traverse category array to extract category ids marked as found

I want to use as a data storage postgresql. Tried several data structures, 
testing btree, gin, gist indecies over them, but best achieved performance for 
a 10 times smaller dataset (10k cat ids, 100k doc ids, 1m relations) is slower 
more than 5 times.

I read about postgresql bitmap indecies and "data lookup" when scanning 
indecies to get a value for current transaction. Downloaded latest v8.4 
snapshot, compiled it, but as I see there is no bitmap index in it. Maybe if I 
download HEAD revision I will find them there, dont know.

Anyway, I want to ask, have anyone faced similar situation, and is there any 
way to achive closer to optimal performance using postgresql functionality and 
extensibility?

Regards,
Chavdar Kopoev


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Increasing pattern index query speed

2008-11-26 Thread Andrus

Richard,


And the results were?


Results are provided in bottom of the message to which you replied.


One problem at a time. Let's get the pattern-matching speed problems on
your live server sorted, then we can look at different queries.


First message in this thread described the issue with query having
additional condition

AND dok.kuupaev BETWEEN '2008-11-21' AND  '2008-11-21'

It seems that this problem occurs when pattern matching and BETWEEN
conditions are used in same query.

According to Scott Garey great recommendation I added

ALTER TABLE rid ALTER COLUMN toode SET STATISTICS 1000;

This fixes testcase in live server, see my other message.
Is it OK to run

ALTER TABLE rid ALTER COLUMN toode SET STATISTICS 1000

in prod database or should I try to decrease 1000 to smaller value ?
rid is big increasing table and is changed frequently, mostly by adding 
rows.


Andrus. 



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Increasing pattern index query speed

2008-11-26 Thread Andrus

Scott,

My first thought on the query where a pattern being faster than the query 
with an exact value is that the planner does not have good enough 
statistics on that column.  Without looking at the explain plans further, I 
would suggest trying something simple.  The fact that it is fasster on 8.3 
but slower on 8.1 may have to do with changes between versions, or may 
simply be due to luck in the statistics sampling.
See if increasing the statistics target on that column significantly does 
anything:

EXPLAIN (your query);

ALTER TABLE orders_products ALTER COLUMN product_id SET STATISTICS 2000;
ANALYZE orders_products;
EXPLAIN (your query);
2000 is simply a guess of mine for a value much larger than the default. 
This will generally make query planning slower but the system will have a 
lot more data about that column and the distribution of data in it.  This 
should help stabilize the query performance.

If this has an effect, the query plans will change.
Your question below really boils down to something more simple:
--Why is the most optimal query plan not chosen?  This is usually due to 
either insufficient statistics or quirks in how the query planner works on 
a specific data >set or with certain configuration options.


Thank you very much.
I found that  AND dok.kuupaev = date'2008-11-21' runs fast but
AND dok.kuupaev BETWEEN date'2008-11-21' AND date'2008-11-21'  runs very 
slow.


explain SELECT sum(1)
  FROM dok JOIN rid USING (dokumnr)
  JOIN toode USING (toode)
  WHERE rid.toode like '9910%'

plan with default statistics:

"Aggregate  (cost=17.86..17.87 rows=1 width=0)"
"  ->  Nested Loop  (cost=0.00..17.85 rows=1 width=0)"
"->  Nested Loop  (cost=0.00..11.84 rows=1 width=24)"
"  Join Filter: ("outer".dokumnr = "inner".dokumnr)"
"  ->  Index Scan using dok_kuupaev_idx on dok  (cost=0.00..5.81 
rows=1 width=4)"
"Index Cond: ((kuupaev >= '2008-11-21'::date) AND 
(kuupaev <= '2008-11-21'::date))"
"  ->  Index Scan using rid_toode_pattern_idx on rid 
(cost=0.00..6.01 rows=1 width=28)"
"Index Cond: ((toode ~>=~ '9910'::bpchar) AND (toode 
~<~ '9911'::bpchar))"

"Filter: (toode ~~ '9910%'::text)"
"->  Index Scan using toode_pkey on toode  (cost=0.00..6.00 rows=1 
width=24)"

"  Index Cond: ("outer".toode = toode.toode)"

after statistics is changed query runs fast ( 70 ... 1000 ms)

ALTER TABLE rid ALTER COLUMN toode SET STATISTICS 1000;
analyze rid;
explain analyze SELECT sum(1)
  FROM dok JOIN rid USING (dokumnr)
  JOIN toode USING (toode)
  WHERE rid.toode like '9910%'
  AND dok.kuupaev BETWEEN date'2008-11-21' AND date'2008-11-21'
"Aggregate  (cost=27.04..27.05 rows=1 width=0) (actual time=44.830..44.834 
rows=1 loops=1)"
"  ->  Nested Loop  (cost=0.00..27.04 rows=1 width=0) (actual 
time=0.727..44.370 rows=108 loops=1)"
"->  Nested Loop  (cost=0.00..21.02 rows=1 width=24) (actual 
time=0.688..40.519 rows=108 loops=1)"
"  ->  Index Scan using dok_kuupaev_idx on dok  (cost=0.00..5.81 
rows=1 width=4) (actual time=0.027..8.094 rows=1678 loops=1)"
"Index Cond: ((kuupaev >= '2008-11-21'::date) AND 
(kuupaev <= '2008-11-21'::date))"
"  ->  Index Scan using rid_dokumnr_idx on rid 
(cost=0.00..15.20 rows=1 width=28) (actual time=0.011..0.011 rows=0 
loops=1678)"

"Index Cond: ("outer".dokumnr = rid.dokumnr)"
"Filter: (toode ~~ '9910%'::text)"
"->  Index Scan using toode_pkey on toode  (cost=0.00..6.00 rows=1 
width=24) (actual time=0.016..0.020 rows=1 loops=108)"

"  Index Cond: ("outer".toode = toode.toode)"
"Total runtime: 45.050 ms"

It seems that you are genius.

I used 1000 since doc wrote that max value is 1000

Rid table contains 3.5millions rows, will increase 1 millions of rows per 
year and is updated frequently, mostly by adding.


Is it OK to leave

SET STATISTICS 1000;

setting for this table this column or should  I try to decrease it ?

Andrus. 



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Increasing pattern index query speed

2008-11-26 Thread Richard Huxton
Andrus wrote:
> Richard and Mario,
> 
>> You can't use xxx_pattern_ops indexes for non-pattern tests.
> 
> I missed regular index. Sorry for that. Now issue with testcase is
> solved. Thank you very much.
> 
> I researched issue in live 8.1.4 db a bit more.
> Performed vacuum and whole db reindex.
> Tried several times to run two same pattern queries in quiet db.

And the results were?

> additonal condition

One problem at a time. Let's get the pattern-matching speed problems on
your live server sorted, then we can look at different queries.

-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Increasing pattern index query speed

2008-11-26 Thread Andrus

Richard and Mario,


You can't use xxx_pattern_ops indexes for non-pattern tests.


I missed regular index. Sorry for that. Now issue with testcase is solved. 
Thank you very much.


I researched issue in live 8.1.4 db a bit more.
Performed vacuum and whole db reindex.
Tried several times to run two same pattern queries in quiet db.

additonal condition

AND dok.kuupaev BETWEEN date'2008-11-21' AND date'2008-11-21'

takes 239 seconds to run.

additonal condition

AND dok.kuupaev = date'2008-11-21'

takes 1 seconds.

Both query conditions are logically the same.
How to make BETWEEN query fast (real queries are running as between queries 
over some date range)?


P.S. VACUUM issues warning that free space map 15 is not sufficient, 
16 nodes reqired.
Two days ago after vacuum full there were 6 used enties in FSM. No idea 
why this occurs.


Andrus.

set search_path to firma2,public;
explain analyze SELECT sum(1)
  FROM dok JOIN rid USING (dokumnr)
  JOIN toode USING (toode)
  WHERE rid.toode like '9910%'
  AND dok.kuupaev BETWEEN date'2008-11-21' AND date'2008-11-21'
"Aggregate  (cost=17.86..17.87 rows=1 width=0) (actual 
time=239346.647..239346.651 rows=1 loops=1)"
"  ->  Nested Loop  (cost=0.00..17.85 rows=1 width=0) (actual 
time=3429.715..239345.923 rows=108 loops=1)"
"->  Nested Loop  (cost=0.00..11.84 rows=1 width=24) (actual 
time=3429.666..239339.687 rows=108 loops=1)"

"  Join Filter: ("outer".dokumnr = "inner".dokumnr)"
"  ->  Index Scan using dok_kuupaev_idx on dok  (cost=0.00..5.81 
rows=1 width=4) (actual time=0.028..13.341 rows=1678 loops=1)"
"Index Cond: ((kuupaev >= '2008-11-21'::date) AND 
(kuupaev <= '2008-11-21'::date))"
"  ->  Index Scan using rid_toode_pattern_idx on rid 
(cost=0.00..6.01 rows=1 width=28) (actual time=0.025..86.156 rows=15402 
loops=1678)"
"Index Cond: ((toode ~>=~ '9910'::bpchar) AND (toode 
~<~ '9911'::bpchar))"

"Filter: (toode ~~ '9910%'::text)"
"->  Index Scan using toode_pkey on toode  (cost=0.00..6.00 rows=1 
width=24) (actual time=0.032..0.037 rows=1 loops=108)"

"  Index Cond: ("outer".toode = toode.toode)"
"Total runtime: 239347.132 ms"

explain analyze SELECT sum(1)
  FROM dok JOIN rid USING (dokumnr)
  JOIN toode USING (toode)
  WHERE rid.toode like '9910%'
  AND dok.kuupaev = date'2008-11-21'
"Aggregate  (cost=17.86..17.87 rows=1 width=0) (actual time=707.028..707.032 
rows=1 loops=1)"
"  ->  Nested Loop  (cost=0.00..17.85 rows=1 width=0) (actual 
time=60.890..706.460 rows=108 loops=1)"
"->  Nested Loop  (cost=0.00..11.84 rows=1 width=24) (actual 
time=60.848..701.908 rows=108 loops=1)"
"  ->  Index Scan using rid_toode_pattern_idx on rid 
(cost=0.00..6.01 rows=1 width=28) (actual time=0.120..247.636 rows=15402 
loops=1)"
"Index Cond: ((toode ~>=~ '9910'::bpchar) AND (toode 
~<~ '9911'::bpchar))"

"Filter: (toode ~~ '9910%'::text)"
"  ->  Index Scan using dok_dokumnr_idx on dok  (cost=0.00..5.81 
rows=1 width=4) (actual time=0.021..0.021 rows=0 loops=15402)"

"Index Cond: (dok.dokumnr = "outer".dokumnr)"
"Filter: (kuupaev = '2008-11-21'::date)"
"->  Index Scan using toode_pkey on toode  (cost=0.00..6.00 rows=1 
width=24) (actual time=0.021..0.026 rows=1 loops=108)"

"  Index Cond: ("outer".toode = toode.toode)"
"Total runtime: 707.250 ms"

vmstat 5 output during running slower query:

procs ---memory-- ---swap-- -io --system-- cpu
r  b   swpd   free   buff  cache   si   sobibo   incs us sy id 
wa
2  0332 738552  0 126483200 4 1111  6  1 83 
10
1  0332 738520  0 126483200 0   135  25934 24 76  0 
0
1  0332 738488  0 126483200 0   112  26342 24 76  0 
0
1  0332 738504  0 126483200 013  25219 23 77  0 
0
1  0332 738528  0 126483200 031  25526 26 74  0 
0
1  0332 738528  0 126483200 0 6  25118 27 73  0 
0
1  0332 738544  0 126485600 522  25425 27 73  0 
0
1  0332 737908  0 126485600 013  25222 27 73  0 
0
1  0332 737932  0 126485600 0 2  25118 23 77  0 
0
1  0332 737932  0 126485600 0 2  25117 25 75  0 
0
1  0332 737932  0 126485600 0 4  25219 25 75  0 
0
1  0332 737932  0 126485600 0 0  25016 26 74  0 
0
1  0332 737932  0 126485600 0 8  25219 26 74  0 
0
1  0332 737924  0 126485600 067  25219 24 76  0 
0
1  0332 737900  0 126485600 013  25837 25 75  0 
0
1  0332 737916  0 126485600 0 0  251  

Re: [PERFORM] many to many performance

2008-11-26 Thread Craig Ringer
Chavdar Kopoev wrote:

> I want to use as a data storage postgresql. Tried several data structures, 
> testing btree, gin, gist indecies over them, but best achieved performance 
> for a 10 times smaller dataset (10k cat ids, 100k doc ids, 1m relations) is 
> slower more than 5 times.

Can you post your queries and table definitions so people trying to help
you know what you did / tried to do? A downloadable self contained
example might also be useful.

Please also post the output of `EXPLAIN' on your queries, eg:

EXPLAIN SELECT blah, ... FROM blah;

> I read about postgresql bitmap indecies and "data lookup" when scanning 
> indecies to get a value for current transaction. Downloaded latest v8.4 
> snapshot, compiled it, but as I see there is no bitmap index in it. Maybe if 
> I download HEAD revision I will find them there, dont know.

Bitmap index scans are an internal function that's used to combine two
indexes on the fly during a query (or at least use both of them in one
table scan). You don't make a bitmap index, you just make two ordinary
btree indexes and let Pg take care of this for you.

If you query on the columns of interest a lot, you might want to use a
multi-column index instead.

--
Craig Ringer

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] many to many performance

2008-11-26 Thread Chavdar Kopoev
Hello,

I have following common situation:

Category IDs: about 50 000
Document IDs: about 3 000 000
Many to many relationship.
A document id have a relation with 10 up to 1000 category ids
One query, with input set of document ids, resulting set of category ids, 
having relation with input ids. (very often this query is called with more than 
500k of input document ids)

I use custom written datawarehouse, file storage, and memory kept "id offset" 
like indecies. So a query for all 3 million document ids, resulting almost all 
category ids take less than a second on desktop machine. Abstracting from 
concrete realization, query plan is like:
1. for each input document id: look up an array by document id and retrive a 
list of ralated category ids.
1.1 for each category id in the list: look up an array value by category id and 
mark it as found
2. traverse category array to extract category ids marked as found

I want to use as a data storage postgresql. Tried several data structures, 
testing btree, gin, gist indecies over them, but best achieved performance for 
a 10 times smaller dataset (10k cat ids, 100k doc ids, 1m relations) is slower 
more than 5 times.

I read about postgresql bitmap indecies and "data lookup" when scanning 
indecies to get a value for current transaction. Downloaded latest v8.4 
snapshot, compiled it, but as I see there is no bitmap index in it. Maybe if I 
download HEAD revision I will find them there, dont know.

Anyway, I want to ask, have anyone faced similar situation, and is there any 
way to achive closer to optimal performance using postgresql functionality and 
extensibility?

Regards,
Chavdar Kopoev


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Increasing pattern index query speed

2008-11-26 Thread Mario Weilguni

Andrus schrieb:

Richard,


These are the same but the times are different. I'd be very surprised if
you can reproduce these times reliably.


I re-tried today again and got same results: in production database 
pattern query  is many times slower that equality query.

toode and rid base contain only single product starting with 9910
So both queries should scan exactly same numbers of rows.


Can I give you some wider-ranging suggestions Andrus?
1. Fix the vacuuming issue in your hash-join question.


I have ran VACUUM FULL VERBOSE ANALYSE and set max_fsm_pages=15
So issue is fixed before those tests.

2. Monitor the system to make sure you know if/when disk activity is 
high.


I optimized this system. Now there are short (some seconds) sales 
queries about after every 5 - 300 seconds which cause few disk 
activity and add few new rows to some tables.

I havent seen that this activity affects to this test result.


3. *Then* start to profile individual queries and look into their plans.
Change the queries one at a time and monitor again.


How to change pattern matching query to faster ?

Andrus.

Btw.

I tried to reproduce this big difference in test server in 8.3 using 
sample data script below and got big difference but in opposite 
direction.


explain analyze   SELECT sum(1)
FROM   orders
JOIN orders_products USING (order_id)
JOIN products USING (product_id)
WHERE orders.order_date>'2006-01-01' and ...

different where clauses produce different results:

AND orders_products.product_id = '3370'  -- 880 .. 926 ms
AND  orders_products.product_id like '3370%' -- 41 ..98 ms

So patter index is 10 .. 20 times (!) faster always.
No idea why.

Test data creation script:

begin;
CREATE OR REPLACE FUNCTION Counter() RETURNS int IMMUTABLE AS
$_$
SELECT 350;
$_$ LANGUAGE SQL;

CREATE TEMP TABLE orders (order_id INTEGER NOT NULL, order_date DATE 
NOT NULL);
CREATE TEMP TABLE products (product_id CHAR(20) NOT NULL, product_name 
char(70) NOT NULL, quantity numeric(12,2) default 1);
CREATE TEMP TABLE orders_products (order_id INTEGER NOT NULL, 
product_id CHAR(20),

 id serial, price numeric(12,2) default 1 );

INSERT INTO products SELECT (n*power( 10,13))::INT8::CHAR(20),
  'product number ' || n::TEXT FROM generate_series(0,13410) AS n;

INSERT INTO orders
SELECT n,'2005-01-01'::date + (4000.0 * n/Counter() * '1 DAY'::interval)
FROM generate_series(0, Counter()/3 ) AS n;

SET work_mem TO 2097151;

INSERT INTO orders_products SELECT
  generate_series/3 as  order_id,
  ( (1+ (generate_series % 13410))*power( 10,13))::INT8::CHAR(20) AS 
product_id

FROM generate_series(1, Counter());

ALTER TABLE orders ADD PRIMARY KEY (order_id);
ALTER TABLE products ADD PRIMARY KEY (product_id);
ALTER TABLE orders_products ADD PRIMARY KEY (id);

ALTER TABLE orders_products ADD FOREIGN KEY (product_id) REFERENCES 
products(product_id);
ALTER TABLE orders_products ADD FOREIGN KEY (order_id) REFERENCES 
orders(order_id) ON DELETE CASCADE;


CREATE INDEX orders_date ON orders( order_date );
CREATE INDEX order_product_pattern_idx ON orders_products( product_id 
bpchar_pattern_ops );


COMMIT;
SET work_mem TO DEFAULT;
ANALYZE;


No wonder that = compares bad, you created the index this way:
CREATE INDEX order_product_pattern_idx ON orders_products( product_id 
bpchar_pattern_ops );

why not:
CREATE INDEX order_product_pattern_idx ON orders_products( product_id);

explain analyze   SELECT sum(1)
FROM   orders
JOIN orders_products USING (order_id)
JOIN products USING (product_id)
WHERE orders.order_date>'2006-01-01'
AND orders_products.product_id = '3370';

   QUERY 
PLAN
---
Aggregate  (cost=3013.68..3013.69 rows=1 width=0) (actual 
time=8.206..8.207 rows=1 loops=1)
  ->  Nested Loop  (cost=10.83..3013.21 rows=185 width=0) (actual 
time=2.095..7.962 rows=189 loops=1)
->  Index Scan using products_pkey on products  
(cost=0.00..8.27 rows=1 width=18) (actual time=0.036..0.038 rows=1 loops=1)

  Index Cond: ((product_id)::text = '3370'::text)
->  Nested Loop  (cost=10.83..3003.09 rows=185 width=18) 
(actual time=2.052..7.474 rows=189 loops=1)
  ->  Bitmap Heap Scan on orders_products  
(cost=10.83..949.68 rows=253 width=22) (actual time=0.161..0.817 
rows=261 loops=1)
Recheck Cond: ((product_id)::text = 
'3370'::text)
->  Bitmap Index Scan on foo  (cost=0.00..10.76 
rows=253 width=0) (actual time=0.116..0.116 rows=261 loops=1)
  Index Cond: ((product_id)::text = 
'3370'::text)
  ->  Index Scan using orders_pkey on orders  
(cost=0.00..8.10 rows=1 width=4) (actual time=0.020..0

Re: [PERFORM] Increasing pattern index query speed

2008-11-26 Thread Richard Huxton
Andrus wrote:
> 
> So patter index is 10 .. 20 times (!) faster always.
> No idea why.

Because you don't have a normal index on the product_id column? You
can't use xxx_pattern_ops indexes for non-pattern tests.

> Test data creation script:

The only change to the script was the obvious char(nn) => varchar(nn)
and I didn't use TEMP tables (so I could see what I was doing). Then, I
created the "standard" index on order_products.product_id.

EXPLAIN ANALYSE from my slow dev box are listed below. Database is in
LATIN9 encoding with locale=C.


 QUERY PLAN

 Aggregate  (cost=2993.69..2993.70 rows=1 width=0) (actual
time=2.960..2.960 rows=1 loops=1)
   ->  Nested Loop  (cost=10.81..2993.23 rows=182 width=0) (actual
time=0.972..2.901 rows=189 loops=1)
 ->  Index Scan using products_pkey on products
(cost=0.00..8.27 rows=1 width=18) (actual time=0.017..0.019 rows=1 loops=1)
   Index Cond: ((product_id)::text = '3370'::text)
 ->  Nested Loop  (cost=10.81..2983.14 rows=182 width=18)
(actual time=0.951..2.785 rows=189 loops=1)
   ->  Bitmap Heap Scan on orders_products
(cost=10.81..942.50 rows=251 width=22) (actual time=0.296..0.771
rows=261 loops=1)
 Recheck Cond: ((product_id)::text =
'3370'::text)
 ->  Bitmap Index Scan on
order_product_pattern_eq_idx  (cost=0.00..10.75 rows=251 width=0)
(actual time=0.230..0.230 rows=261 loops=1)
   Index Cond: ((product_id)::text =
'3370'::text)
   ->  Index Scan using orders_pkey on orders
(cost=0.00..8.12 rows=1 width=4) (actual time=0.006..0.007 rows=1 loops=261)
 Index Cond: (orders.order_id =
orders_products.order_id)
 Filter: (orders.order_date > '2006-01-01'::date)
 Total runtime: 3.051 ms
(13 rows)


QUERY PLAN
---
 Aggregate  (cost=25.56..25.57 rows=1 width=0) (actual time=8.244..8.245
rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..25.55 rows=1 width=0) (actual
time=1.170..8.119 rows=378 loops=1)
 ->  Nested Loop  (cost=0.00..17.17 rows=1 width=4) (actual
time=0.043..4.167 rows=522 loops=1)
   ->  Index Scan using order_product_pattern_eq_idx on
orders_products  (cost=0.00..8.88 rows=1 width=22) (actual
time=0.029..1.247 rows=522 loops=1)
 Index Cond: (((product_id)::text >=
'3370'::text) AND ((product_id)::text <
'3371'::text))
 Filter: ((product_id)::text ~~
'3370%'::text)
   ->  Index Scan using products_pkey on products
(cost=0.00..8.27 rows=1 width=18) (actual time=0.004..0.004 rows=1
loops=522)
 Index Cond: ((products.product_id)::text =
(orders_products.product_id)::text)
 ->  Index Scan using orders_pkey on orders  (cost=0.00..8.37
rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=522)
   Index Cond: (orders.order_id = orders_products.order_id)
   Filter: (orders.order_date > '2006-01-01'::date)
 Total runtime: 8.335 ms
(12 rows)


-- 
  Richard Huxton
  Archonet Ltd

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance