Re: [PERFORM] Retaining execution plans between connections?

2006-01-20 Thread Pandurangan R S
you could use pgpool

http://pgpool.projects.postgresql.org/


On 1/20/06, James Russell [EMAIL PROTECTED] wrote:
 Hi there,

  I am running a website where each page connects to the DB to retrieve and
 write information. Each page load uses a separate connection (rather than
 just sharing one as is the common case) because I use a lot of transactions.

  I am looking to speed up performance, and since each page executes a static
 set of queries where only the parameters change, I was hoping to take
 advantage of stored procedures since I read that PostgreSQL's caches the
 execution plans used inside stored procedures.

  However, the documentation states that this execution plan caching is done
 on a per-connection basis. If each page uses a separate connection, I can
 get no performance benefit between pages.

  In other words, there's no benefit to me in putting a one-shot query that
 is basically the same for every page (e.g. SELECT * FROM users WHERE
 user_name='username') inside a stored proc, since the generated execution
 plan will be thrown away once the connection is dropped.

  Has anyone found a way around this limitation? As I said, I can't share the
 DB connection between pages (unless someone knows of a way to do this and
 still retain a level of separation between pages that use the same DB
 connection).

  Many thanks,

  James


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Retaining execution plans between connections?

2006-01-20 Thread Neil Conway
On Fri, 2006-01-20 at 18:14 +0900, James Russell wrote:
 I am looking to speed up performance, and since each page executes a
 static set of queries where only the parameters change, I was hoping
 to take advantage of stored procedures since I read that PostgreSQL's
 caches the execution plans used inside stored procedures.

Note that you can also take advantage of plan caching by using prepared
statements (PREPARE, EXECUTE and DEALLOCATE). These are also session
local, however (i.e. you can't share prepared statements between
connections).

 As I said, I can't share the DB connection between pages (unless
 someone knows of a way to do this and still retain a level of
 separation between pages that use the same DB connection).

You can't share plans among different sessions at the moment. Can you
elaborate on why you can't use persistent or pooled database
connections?

-Neil



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Extremely irregular query performance

2006-01-20 Thread Bruce Momjian
Simon Riggs wrote:
 On Wed, 2006-01-11 at 22:23 -0500, Tom Lane wrote:
  =?iso-8859-1?Q?Jean-Philippe_C=F4t=E9?= [EMAIL PROTECTED] writes:
   Thanks a lot for this info, I was indeed exceeding the genetic
   optimizer's threshold.  Now that it is turned off, I get
   a very stable response time of 435ms (more or less 5ms) for
   the same query. It is about three times slower than the best
   I got with the genetic optimizer on, but the overall average
   is much lower.
  
  Hmm.  It would be interesting to use EXPLAIN ANALYZE to confirm that the
  plan found this way is the same as the best plan found by GEQO, and
  the extra couple hundred msec is the price you pay for the exhaustive
  plan search.  If GEQO is managing to find a plan better than the regular
  planner then we need to look into why ...
 
 It seems worth noting in the EXPLAIN whether GEQO has been used to find
 the plan, possibly along with other factors influencing the plan such as
 enable_* settings.

I thought the best solution would be to replace QUERY PLAN with GEQO
QUERY PLAN when GEQO was in use.  However, looking at the code, I see
no way to do that cleanly.

Instead, I added documentation to EXPLAIN to highlight the fact the
execution plan will change when GEQO is in use.

(I also removed a documentation mention of the pre-7.3 EXPLAIN output
behavior.)

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073
Index: doc/src/sgml/ref/explain.sgml
===
RCS file: /cvsroot/pgsql/doc/src/sgml/ref/explain.sgml,v
retrieving revision 1.35
diff -c -c -r1.35 explain.sgml
*** doc/src/sgml/ref/explain.sgml   4 Jan 2005 00:39:53 -   1.35
--- doc/src/sgml/ref/explain.sgml   20 Jan 2006 16:18:53 -
***
*** 151,161 
/para
  
para
!Prior to productnamePostgreSQL/productname 7.3, the plan was
!emitted in the form of a literalNOTICE/literal message.  Now it
!appears as a query result (formatted like a table with a single
!text column).
/para
   /refsect1
  
   refsect1
--- 151,162 
/para
  
para
!Genetic query optimization (acronymGEQO/acronym) randomly 
!tests execution plans.  Therefore, when the number of tables 
!exceeds varnamegeqo/ and genetic query optimization is in use,
!the execution plan will change each time the statement is executed.
/para
+ 
   /refsect1
  
   refsect1

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Extremely irregular query performance

2006-01-20 Thread Tom Lane
Bruce Momjian pgman@candle.pha.pa.us writes:
 para
 !Genetic query optimization (acronymGEQO/acronym) randomly 
 !tests execution plans.  Therefore, when the number of tables 
 !exceeds varnamegeqo/ and genetic query optimization is in use,
 !the execution plan will change each time the statement is executed.
 /para

geqo_threshold, please --- geqo is a boolean.

Possibly better wording:  Therefore, when the number of tables exceeds
geqo_threshold causing genetic query optimization to be used, the
execution plan is likely to change each time the statement is executed.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Autovacuum / full vacuum (off-topic?)

2006-01-20 Thread Jim C. Nasby
BTW, given all the recent discussion about vacuuming and our MVCC,
http://www.pervasive-postgres.com/lp/newsletters/2006/Insights_Postgres_Jan.asp#3
should prove interesting. :)
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] SELECT MIN, MAX took longer time than SELECT COUNT, MIN, MAX

2006-01-20 Thread Jim C. Nasby
On Fri, Jan 20, 2006 at 12:35:36PM +0800, K C Lau wrote:

Here's the problem... the estimate for the backwards index scan is *way*
off:

  -  Limit  (cost=0.00..1.26 rows=1 width=4) (actual 
 time=200032.928..200032.931 rows=1 loops=1)
-  Index Scan Backward using pk_log on 
 log  (cost=0.00..108047.11 rows=86089 width=4) (actual 
 time=200032.920..200032.920 rows=1 loops=1)
  Filter: (((create_time)::text  '2005/10/19'::text) AND 
 (logsn IS NOT NULL))
  Total runtime: 200051.701 ms

BTW, these queries below are meaningless; they are not equivalent to
min(logsn).

 esdt= explain analyze select LogSN from Log where create_time  
 '2005/10/19' order by create_time limit 1;
 
  Limit  (cost=0.00..0.98 rows=1 width=31) (actual time=0.071..0.073 rows=1 
 loops=1)
-  Index Scan using idx_logtime on log  (cost=0.00..84649.94 
 rows=86089 width=31) (actual time=0.063..0.063 rows=1 loops=1)
  Index Cond: ((create_time)::text  '2005/10/19'::text)
  Total runtime: 0.182 ms
 
 esdt= explain analyze select LogSN from Log where create_time  
 '2005/10/19' order by create_time desc limit 1;
  Limit  (cost=0.00..0.98 rows=1 width=31) (actual time=0.058..0.061 rows=1 
 loops=1)
-  Index Scan Backward using idx_logtime on log  (cost=0.00..84649.94 
 rows=86089 width=31) (actual time=0.051..0.051 rows=1 loops=1)
  Index Cond: ((create_time)::text  '2005/10/19'::text)
  Total runtime: 0.186 ms
 
 
 ---(end of broadcast)---
 TIP 6: explain analyze is your friend
 

-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Autovacuum / full vacuum (off-topic?)

2006-01-20 Thread Joshua D. Drake

Jim C. Nasby wrote:

BTW, given all the recent discussion about vacuuming and our MVCC,
http://www.pervasive-postgres.com/lp/newsletters/2006/Insights_Postgres_Jan.asp#3
should prove interesting. :)
  
Please explain... what is the .asp extension. I have yet to see it 
reliable in production ;)





--
The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: PLphp, PLperl - http://www.commandprompt.com/


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Autovacuum / full vacuum (off-topic?)

2006-01-20 Thread Jim C. Nasby
On Fri, Jan 20, 2006 at 09:31:14AM -0800, Joshua D. Drake wrote:
 Jim C. Nasby wrote:
 BTW, given all the recent discussion about vacuuming and our MVCC,
 http://www.pervasive-postgres.com/lp/newsletters/2006/Insights_Postgres_Jan.asp#3
 should prove interesting. :)
   
 Please explain... what is the .asp extension. I have yet to see it 
 reliable in production ;)

I lay no claim to our infrastructure. :)
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Autovacuum / full vacuum (off-topic?)

2006-01-20 Thread Joshua D. Drake



I lay no claim to our infrastructure. :)
  

Can I quote the: Pervasive Senior Engineering Consultant on that?

--
The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: PLphp, PLperl - http://www.commandprompt.com/


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Autovacuum / full vacuum (off-topic?)

2006-01-20 Thread Jim C. Nasby
On Fri, Jan 20, 2006 at 09:37:50AM -0800, Joshua D. Drake wrote:
 
 I lay no claim to our infrastructure. :)
   
 Can I quote the: Pervasive Senior Engineering Consultant on that?

Sure... I've never been asked to consult on our stuff, and in any case,
I don't do web front-ends (one of the nice things about working with a
team of other consultants). AFAIK IIS will happily talk to PostgreSQL
(though maybe I'm wrong there...)

I *have* asked what database is being used on the backend though, and
depending on the answer to that some folks might have some explaining to
do. :)

*grabs big can of dog food*
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Autovacuum / full vacuum (off-topic?)

2006-01-20 Thread me

Sure... I've never been asked to consult on our stuff, and in any case,
I don't do web front-ends (one of the nice things about working with a
team of other consultants). AFAIK IIS will happily talk to PostgreSQL
(though maybe I'm wrong there...)


iis (yeah, asp in a successfull productive environement hehe)  postgresql 
works even better for us than iis  mssql :-)


- thomas 




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


[PERFORM] query stopped working after tables 50000 records

2006-01-20 Thread Antoine

Hi,
I have a query that does a left outer join. The query gets some text 
from a reference table where one of the query's main tables may or may 
not have the text's tables id. It wasn't super fast, but now it simply 
won't execute. It won't complete either through odbc or via pgadmin 
(haven't yet tried via psql). A week ago (with considerably fewer 
records in the main table) it executed fine, not particularly quickly, 
but not that slowly either. Now it locks up postgres completely (if 
nothing else needs anything it takes 100% cpu), and even after an hour 
gives me nothing. I have come up with a solution that gets the text via 
another query (possibly even a better solution), but this seems very 
strange.
Can anyone shed some light on the subject? I tried a full vacuum on the 
tables that needed it, and a postgres restart, all to no avail.

Cheers
Antoine
ps. I can send the query if that will help...
pps. running a home-compiled 8.1.1 with tables in the query having 7 
records, 3 records and 10 for the outer join. Without the left outer 
join it runs in ~ 1 second.


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] [PERFORMANCE] Stored Procedures

2006-01-20 Thread Rikard Pavelic

Hi,
 
Will simple queries such as SELECT * FROM blah_table WHERE tag='x'; work any

faster by putting them into a stored procedure?




IMHO no, why do you think so? You can use PREPARE instead, if you have many
selects like this.



I tought that creating stored procedures in database means
storing it's execution plan (well, actually storing it like a
compiled object). Well, that's what I've learned couple a years
ago in colledge ;)

What are the advantages of parsing SP functions every time it's called?

My position is that preparing stored procedures for execution solves
more problems, that it creates.
And the most important one to be optimizing access to queries from 
multiple connections (which is one of the most important reasons 
for using stored procedures in the first place).


Best regards,
Rikard


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] SELECT MIN, MAX took longer time than SELECT COUNT, MIN, MAX

2006-01-20 Thread Tom Lane
Jim C. Nasby [EMAIL PROTECTED] writes:
 On Fri, Jan 20, 2006 at 12:35:36PM +0800, K C Lau wrote:
 Here's the problem... the estimate for the backwards index scan is *way*
 off:

 -  Limit  (cost=0.00..1.26 rows=1 width=4) (actual 
 time=200032.928..200032.931 rows=1 loops=1)
 -  Index Scan Backward using pk_log on 
 log  (cost=0.00..108047.11 rows=86089 width=4) (actual 
 time=200032.920..200032.920 rows=1 loops=1)
 Filter: (((create_time)::text  '2005/10/19'::text) AND 
 (logsn IS NOT NULL))
 Total runtime: 200051.701 ms

It's more subtle than you think.  The estimated rowcount is the
estimated number of rows fetched if the indexscan were run to
completion, which it isn't because the LIMIT cuts it off after the
first returned row.  That estimate is not bad (we can see from the
aggregate plan that the true value would have been 106708, assuming
that the logsn IS NOT NULL condition isn't filtering anything).

The real problem is that it's taking quite a long time for the scan
to reach the first row with create_time  2005/10/19, which is not
too surprising if logsn is strongly correlated with create_time ...
but in the absence of any cross-column statistics the planner has
no very good way to know that.  (Hm ... but both of them probably
also show a strong correlation to physical order ... we could look
at that maybe ...)  The default assumption is that the two columns
aren't correlated and so it should not take long to hit the first such
row, which is why the planner likes the indexscan/limit plan.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Stored procedures

2006-01-20 Thread Rikard Pavelic



Hi,
 
Will simple queries such as SELECT * FROM blah_table WHERE tag='x'; 
work any

faster by putting them into a stored procedure?




IMHO no, why do you think so? You can use PREPARE instead, if you have 
many

selects like this.



I tought that creating stored procedures in database means
storing it's execution plan (well, actually storing it like a
compiled object). Well, that's what I've learned couple a years
ago in colledge ;)

What are the advantages of parsing SP functions every time it's called?

My position is that preparing stored procedures for execution solves
more problems, that it creates.
And the most important one to be optimizing access to queries from 
multiple connections (which is one of the most important reasons for 
using stored procedures in the first place).


Best regards,
   Rikard



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Autovacuum / full vacuum (off-topic?)

2006-01-20 Thread Jim C. Nasby
On Fri, Jan 20, 2006 at 06:46:45PM +0100, [EMAIL PROTECTED] wrote:
 Sure... I've never been asked to consult on our stuff, and in any case,
 I don't do web front-ends (one of the nice things about working with a
 team of other consultants). AFAIK IIS will happily talk to PostgreSQL
 (though maybe I'm wrong there...)
 
 iis (yeah, asp in a successfull productive environement hehe)  postgresql 
 works even better for us than iis  mssql :-)

Just last night I was talking to someone about different databases and
what-not (he's stuck in a windows shop using MSSQL and I mentioned I'd
heard some bad things about it's stability). I realized at some point
that asking about what large installs of something exist is pretty
pointless... given enough effort you can make almost anything scale. As
an example, there's a cable company with a MySQL database that's nearly
1TB... if that's not proof you can make anything scale, I don't know
what is. ;)

What people really need to ask about is how hard it is to make something
work, and how many problems you're likely to keep encountering.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] query stopped working after tables 50000 records

2006-01-20 Thread Jim C. Nasby
Send query, output of EXPLAIN and table definitions.

On Fri, Jan 20, 2006 at 07:32:34PM +0100, Antoine wrote:
 Hi,
 I have a query that does a left outer join. The query gets some text 
 from a reference table where one of the query's main tables may or may 
 not have the text's tables id. It wasn't super fast, but now it simply 
 won't execute. It won't complete either through odbc or via pgadmin 
 (haven't yet tried via psql). A week ago (with considerably fewer 
 records in the main table) it executed fine, not particularly quickly, 
 but not that slowly either. Now it locks up postgres completely (if 
 nothing else needs anything it takes 100% cpu), and even after an hour 
 gives me nothing. I have come up with a solution that gets the text via 
 another query (possibly even a better solution), but this seems very 
 strange.
 Can anyone shed some light on the subject? I tried a full vacuum on the 
 tables that needed it, and a postgres restart, all to no avail.
 Cheers
 Antoine
 ps. I can send the query if that will help...
 pps. running a home-compiled 8.1.1 with tables in the query having 7 
 records, 3 records and 10 for the outer join. Without the left outer 
 join it runs in ~ 1 second.
 
 ---(end of broadcast)---
 TIP 4: Have you searched our list archives?
 
   http://archives.postgresql.org
 

-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] [PERFORMANCE] Stored Procedures

2006-01-20 Thread Jim C. Nasby
On Fri, Jan 20, 2006 at 07:50:23PM +0100, Rikard Pavelic wrote:
 Hi,
  
 Will simple queries such as SELECT * FROM blah_table WHERE tag='x'; work 
 any
 faster by putting them into a stored procedure?
 
 
 IMHO no, why do you think so? You can use PREPARE instead, if you have many
 selects like this.
 
 
 I tought that creating stored procedures in database means
 storing it's execution plan (well, actually storing it like a
 compiled object). Well, that's what I've learned couple a years
 ago in colledge ;)

My college professor said it, it must be true! ;P

My understanding is that in plpgsql, 'bare' queries get prepared and act
like prepared statements. IE:

SELECT INTO variable
field
FROM table
WHERE condition = true
;

 What are the advantages of parsing SP functions every time it's called?
 
 My position is that preparing stored procedures for execution solves
 more problems, that it creates.
 And the most important one to be optimizing access to queries from 
 multiple connections (which is one of the most important reasons 
 for using stored procedures in the first place).

Ok, so post some numbers then. It might be interesting to look at the
cost of preparing a statement, although AFAIK that does not store the
query plan anywhere.

In most databases, query planning seems to be a pretty expensive
operation. My experience is that that isn't the case with PostgreSQL.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Tom Lane
[ thread moved to pgsql-performance ]

I've obtained a gprof profile on Stephan's sample case (many thanks for
providing the data, Stephan).  The command is
CREATE INDEX foo ON publications_test USING gist (fti_title);
where fti_title is a tsvector column.  There are 236984 rows in the
table, most with between 4 and 10 words in fti_title.
sum(length(fti_title)) yields 1636202 ... not sure if this is a
relevant measure, however.

Using CVS tip with a fairly vanilla configuration (including
--enable-cassert), here are all the hotspots down to the 1% level:

  %   cumulative   self  self total   
 time   seconds   secondscalls   s/call   s/call  name
 20.19  1.90 1.90   588976 0.00 0.00  gistchoose
 19.02  3.69 1.79   683471 0.00 0.00  XLogInsert
  5.95  4.25 0.56  3575135 0.00 0.00  LWLockAcquire
  4.46  4.67 0.42  3579005 0.00 0.00  LWLockRelease
  4.14  5.06 0.39  3146848 0.00 0.00  AllocSetAlloc
  3.72  5.41 0.35   236984 0.00 0.00  gistdoinsert
  3.40  5.73 0.32   876047 0.00 0.00  hash_search
  2.76  5.99 0.26  3998576 0.00 0.00  LockBuffer
  2.28  6.21 0.22 11514275 0.00 0.00  gistdentryinit
  1.86  6.38 0.18   841757 0.00 0.00  UnpinBuffer
  1.81  6.55 0.17 12201023 0.00 0.00  FunctionCall1
  1.81  6.72 0.17   237044 0.00 0.00  AllocSetCheck
  1.49  6.86 0.14   236984 0.00 0.00  gistmakedeal
  1.49  7.00 0.14 10206985 0.00 0.00  FunctionCall3
  1.49  7.14 0.14  1287874 0.00 0.00  MemoryContextAllocZero
  1.28  7.26 0.12   826179 0.00 0.00  PinBuffer
  1.17  7.37 0.11   875785 0.00 0.00  hash_any
  1.17  7.48 0.11  1857292 0.00 0.00  MemoryContextAlloc
  1.17  7.59 0.11   221466 0.00 0.00  PageIndexTupleDelete
  1.06  7.69 0.10  9762101 0.00 0.00  gistpenalty

Clearly, one thing that would be worth doing is suppressing the WAL
traffic when possible, as we already do for btree builds.  It seems
that gistchoose may have some internal ineffiency too --- I haven't
looked at the code yet.  The other thing that jumps out is the very
large numbers of calls to gistdentryinit, FunctionCall1, FunctionCall3.
Some interesting parts of the calls/calledby graph are:

---
0.358.07  236984/236984  gistbuildCallback [14]
[15]89.50.358.07  236984 gistdoinsert [15]
0.143.55  236984/236984  gistmakedeal [16]
1.900.89  588976/588976  gistchoose [17]
0.070.83  825960/841757  ReadBuffer [19]
0.090.10  825960/1287874 MemoryContextAllocZero [30]
0.120.05 1888904/3998576 LockBuffer [29]
0.130.00  825960/3575135 LWLockAcquire [21]
0.100.00  825960/3579005 LWLockRelease [26]
0.060.00  473968/3146848 AllocSetAlloc [27]
0.030.00  473968/1857292 MemoryContextAlloc [43]
0.020.00  825960/1272423 gistcheckpage [68]
---
0.143.55  236984/236984  gistdoinsert [15]
[16]39.20.143.55  236984 gistmakedeal [16]
1.200.15  458450/683471  XLogInsert [18]
0.010.66  224997/224997  gistxlogInsertCompletion [20]
0.090.35  444817/444817  gistgetadjusted [23]
0.080.17  456801/456804  formUpdateRdata [32]
0.170.01  827612/841757  UnpinBuffer [35]
0.110.00  221466/221466  PageIndexTupleDelete [42]
0.020.08  456801/460102  gistfillbuffer [45]
0.060.041649/1649gistSplit [46]
0.080.00  685099/3579005 LWLockRelease [26]
0.030.05  446463/446463  gistFindCorrectParent [50]
0.040.02  685099/3998576 LockBuffer [29]
0.040.001649/1649gistextractbuffer [58]
0.030.00  460102/460121  write_buffer [66]
0.020.00  825960/826092  ReleaseBuffer [69]
0.020.00  221402/221402  gistadjscans [82]
0.000.001582/1582gistunion [131]
0.000.001649/1649formSplitRdata [147]
0.000.001649/1649gistjoinvector [178]
0.000.00   3/3   gistnewroot [199]
0.000.00  458450/461748  gistnospace [418]
0.000.00  458450/458450  WriteNoReleaseBuffer [419]
0.000.001652/1671   

Re: [PERFORM] [PERFORMANCE] Stored Procedures

2006-01-20 Thread Rikard Pavelic

Jim C. Nasby wrote:

My college professor said it, it must be true! ;P

  

The famous joke ;)

My understanding is that in plpgsql, 'bare' queries get prepared and act
like prepared statements. IE:

SELECT INTO variable
field
FROM table
WHERE condition = true
;

  
Unfortunately I don't know enough about PostgreSQL, but from responses 
I've been reading I've

come to that conclusion.

Ok, so post some numbers then. It might be interesting to look at the
cost of preparing a statement, although AFAIK that does not store the
query plan anywhere.

In most databases, query planning seems to be a pretty expensive
operation. My experience is that that isn't the case with PostgreSQL.
  


I didn't think about storing query plan anywhere on the disk, rather 
keep them in memory pool.
It would be great if we had an option to use prepare statement for 
stored procedure so it
would prepare it self the first time it's called and remained prepared 
until server shutdown or

memory pool overflow.

This would solve problems with prepare which is per session, so for 
prepared function to be

optimal one must use same connection.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Martijn van Oosterhout
On Fri, Jan 20, 2006 at 02:14:29PM -0500, Tom Lane wrote:
 [ thread moved to pgsql-performance ]
 
 I've obtained a gprof profile on Stephan's sample case (many thanks for
 providing the data, Stephan).  The command is

snip

Something I'm missing is the calls to tsearch functions. I'm not 100%
familiar with gprof, but is it possible those costs have been added
somewhere else because it's in a shared library? Perhaps the costs went
into FunctionCall1/3?

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Martijn van Oosterhout
On Fri, Jan 20, 2006 at 03:21:45PM -0500, Tom Lane wrote:
 If the totals given by gprof are correct, then it's down in the noise.
 I don't think I trust that too much ... but I don't see anything in the
 gprof manual about how to include a dynamically loaded library in the
 profile.  (I did compile tsearch2 with -pg, but that's evidently not
 enough.)

There is some mention on the web of an environment variable you can
set: LD_PROFILE=libname

These pages seem relevent:
http://sourceware.org/ml/binutils/2002-04/msg00047.html
http://www.scit.wlv.ac.uk/cgi-bin/mansec?1+gprof

It's wierd how some man pages for gprof describe this feature, but the
one on my local system doesn't mention it.

 I'll see if I can link tsearch2 statically to resolve this question.

That'll work too...
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Tom Lane
Well, I feel like a fool, because I failed to notice that the total
runtime shown in that profile wasn't anywhere close to the actual wall
clock time.  gprof is indeed simply not counting the time spent in
dynamically-linked code.  With tsearch2 statically linked into the
backend, a more believable picture emerges:

  %   cumulative   self  self total   
 time   seconds   secondscalls  Ks/call  Ks/call  name
 98.96   1495.93  1495.93 33035195 0.00 0.00  hemdistsign
  0.27   1500.01 4.08 10030581 0.00 0.00  makesign
  0.11   1501.74 1.73   588976 0.00 0.00  gistchoose
  0.10   1503.32 1.58   683471 0.00 0.00  XLogInsert
  0.05   1504.15 0.83   246579 0.00 0.00  sizebitvec
  0.05   1504.93 0.78   446399 0.00 0.00  gtsvector_union
  0.03   1505.45 0.52  3576475 0.00 0.00  LWLockRelease
  0.03   1505.92 0.47 1649 0.00 0.00  gtsvector_picksplit
  0.03   1506.38 0.47  3572572 0.00 0.00  LWLockAcquire
  0.02   1506.74 0.36   444817 0.00 0.00  gtsvector_same
  0.02   1507.09 0.35  4077089 0.00 0.00  AllocSetAlloc
  0.02   1507.37 0.28   236984 0.00 0.00  gistdoinsert
  0.02   1507.63 0.26   874195 0.00 0.00  hash_search
  0.02   1507.89 0.26  9762101 0.00 0.00  gtsvector_penalty
  0.01   1508.08 0.19   236984 0.00 0.00  gistmakedeal
  0.01   1508.27 0.19   841754 0.00 0.00  UnpinBuffer
  0.01   1508.45 0.18 22985469 0.00 0.00  hemdistcache
  0.01   1508.63 0.18  3998572 0.00 0.00  LockBuffer
  0.01   1508.81 0.18   686681 0.00 0.00  gtsvector_compress
  0.01   1508.98 0.17 11514275 0.00 0.00  gistdentryinit

So we gotta fix hemdistsign ...

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] [PERFORMANCE] Stored Procedures

2006-01-20 Thread Jim C. Nasby
On Fri, Jan 20, 2006 at 08:38:23PM +0100, Rikard Pavelic wrote:
 This would solve problems with prepare which is per session, so for 
 prepared function to be
 optimal one must use same connection.

If you're dealing with something that's performance critical you're not
going to be constantly re-connecting anyway, so I don't see what the
issue is.
-- 
Jim C. Nasby, Sr. Engineering Consultant  [EMAIL PROTECTED]
Pervasive Software  http://pervasive.comwork: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf   cell: 512-569-9461

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Martijn van Oosterhout
On Fri, Jan 20, 2006 at 04:19:15PM -0500, Tom Lane wrote:
   %   cumulative   self  self total   
  time   seconds   secondscalls  Ks/call  Ks/call  name
  98.96   1495.93  1495.93 33035195 0.00 0.00  hemdistsign

snip

 So we gotta fix hemdistsign ...

lol! Yeah, I guess so. Pretty nasty loop. LOOPBIT will iterate 8*63=504
times and it's going to do silly bit handling on each and every
iteration.

Given that all it's doing is counting bits, a simple fix would be to
loop over bytes, use XOR and count ones. For extreme speedup create a
lookup table with 256 entries to give you the answer straight away...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Steinar H. Gunderson
On Fri, Jan 20, 2006 at 10:37:54PM +0100, Martijn van Oosterhout wrote:
 Given that all it's doing is counting bits, a simple fix would be to
 loop over bytes, use XOR and count ones. For extreme speedup create a
 lookup table with 256 entries to give you the answer straight away...

For extra obfscation:

  unsigned v = (unsigned)c;
  int num_bits = (v * 0x1001001001001ULL  0x84210842108421ULL) % 0x1f;

(More more-or-less intelligent options at
http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive :-) )

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Tom Lane
Martijn van Oosterhout kleptog@svana.org writes:
 Given that all it's doing is counting bits, a simple fix would be to
 loop over bytes, use XOR and count ones. For extreme speedup create a
 lookup table with 256 entries to give you the answer straight away...

Yeah, I just finished doing that and got about a 20x overall speedup
(30-some seconds to build the index instead of 10 minutes).  However,
hemdistsign is *still* 70% of the runtime after doing that.  The problem
seems to be that gtsvector_picksplit is calling it an inordinate number
of times:

0.53   30.021649/1649FunctionCall2 cycle 2 [19]
[20]52.40.53   30.021649 gtsvector_picksplit [20]
   29.740.00 23519673/33035195 hemdistsign [18]
0.140.00 22985469/22985469 hemdistcache [50]
0.120.00  268480/10030581 makesign [25]
0.020.00  270400/270400  fillcache [85]
0.000.009894/4077032 AllocSetAlloc [34]
0.000.009894/2787477 MemoryContextAlloc [69]

(hemdistcache calls hemdistsign --- I think gprof is doing something
funny with tail-calls here, and showing hemdistsign as directly called
from gtsvector_picksplit when control really arrives through hemdistcache.)

The bulk of the problem is clearly in this loop, which performs O(N^2)
comparisons to find the two entries that are furthest apart in hemdist
terms:

for (k = FirstOffsetNumber; k  maxoff; k = OffsetNumberNext(k))
{
for (j = OffsetNumberNext(k); j = maxoff; j = OffsetNumberNext(j))
{
if (k == FirstOffsetNumber)
fillcache(cache[j], GETENTRY(entryvec, j));

size_waste = hemdistcache((cache[j]), (cache[k]));
if (size_waste  waste)
{
waste = size_waste;
seed_1 = k;
seed_2 = j;
}
}
}

I wonder if there is a way to improve on that.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[PERFORM] Sudden slowdown of Pg server

2006-01-20 Thread Jerry Sievers
Hello;  

I am going through a post mortem analysis of an infrequent but
recurring problem on a Pg 8.0.3 installation.  Application code
connects to Pg using J2EE pooled connections.

 PostgreSQL 8.0.3 on sparc-sun-solaris2.9, compiled by GCC 
sparc-sun-solaris2.8-gcc (GCC) 3.3.2

Database is quite large with respect to the number of tables, some of
which have up to 6 million tuples.  Typical idle/busy connection ratio
is 3/100 but occationally we'll catch 20 or more busy sessions.

The problem manifests itself and appears like a locking issue.  About
weekly throuput slows down and we notice the busy connection count
rising minute by minute.  2, 20, 40...  Before long, the app server
detects lack of responsiveness and fails over to another app server
(not Pg) which in turn attempts a bunch of new connections into
Postgres.

Sampling of the snapshots of pg_locks and pg_stat_activity tables
takes place each minute.

I am wishing for a few new ideas as to what to be watching; Here's
some observations that I've made.

1. At no time do any UN-granted locks show in pg_locks
2. The number of exclusive locks is small 1, 4, 8
3. Other locks type/mode are numerous but appear like normal workload.
4. There are at   least a few old 'IDLE In Transaction' cases in
   activity view
5. No interesting error messages or warning in Pg logs.
6. No crash of Pg backend

Other goodies includes a bounty of poor performing queries which are
constantly being optimized now for good measure.  Aside from the heavy
queries, performance is generallly decent.

Resource related server configs have been boosted substantially but
have not undergone any formal RD to verify that we're inthe safe
under heavy load.

An max_fsm_relations setting which is *below* our table and index
count was discovered by me today and will be increased this evening
during a maint cycle.

The slowdown and subsequent run-away app server takes place within a
small 2-5 minute window and I have as of yet not been able to get into
Psql during the event for a hands-on look.

Questions;

1. Is there any type of resource lock that can unconditionally block
   another session and NOT appear as UN-granted lock?

2. What in particular other runtime info would be most useful to
   sample here?

3. What Solaris side runtime stats might give some clues here
   (maybe?)( and how often to sample?  Assume needs to be aggressive
   due to how fast this problem crops up.

Any help appreciated

Thank you


-- 
---
Jerry Sievers   305 854-3001 (home) WWW ECommerce Consultant
305 321-1144 (mobilehttp://www.JerrySievers.com/

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Sudden slowdown of Pg server

2006-01-20 Thread Jignesh K. Shah


lockstat is available in Solaris 9. That can help you to determine if 
there are any kernel level locks that are occuring during that time.
Solaris 10 also has plockstat which can be used to identify userland 
locks happening in your process.


Since you have Solaris 9, try the following:

You can run (as root)
lockstat sleep 5 
and note the output which can be long.


I guess prstat -am output, iostat -xczn 3, vmstat 3 outputs will 
help also.


prstat -am has a column called LAT, if the value is in double digits, 
then you have a locking issue which will probably result in higher SLP 
value for the process.  (Interpretation is data and workload specific 
which this email is too small to decode)


Once you have identified a particular process (if any) to be the source 
of the problem, get its id and you can look at the outputs of following 
command which (quite intrusive)

truss -c -p $pid   2 truss-syscount.txt

(Ctrl-C after a while to stop collecting)

truss -a -e -u::: -p $pid 2 trussout.txt

(Ctrl-C after a while to stop collecting)

Regards,
Jignesh


Jerry Sievers wrote:

Hello;  


I am going through a post mortem analysis of an infrequent but
recurring problem on a Pg 8.0.3 installation.  Application code
connects to Pg using J2EE pooled connections.

PostgreSQL 8.0.3 on sparc-sun-solaris2.9, compiled by GCC 
sparc-sun-solaris2.8-gcc (GCC) 3.3.2

Database is quite large with respect to the number of tables, some of
which have up to 6 million tuples.  Typical idle/busy connection ratio
is 3/100 but occationally we'll catch 20 or more busy sessions.

The problem manifests itself and appears like a locking issue.  About
weekly throuput slows down and we notice the busy connection count
rising minute by minute.  2, 20, 40...  Before long, the app server
detects lack of responsiveness and fails over to another app server
(not Pg) which in turn attempts a bunch of new connections into
Postgres.

Sampling of the snapshots of pg_locks and pg_stat_activity tables
takes place each minute.

I am wishing for a few new ideas as to what to be watching; Here's
some observations that I've made.

1. At no time do any UN-granted locks show in pg_locks
2. The number of exclusive locks is small 1, 4, 8
3. Other locks type/mode are numerous but appear like normal workload.
4. There are at   least a few old 'IDLE In Transaction' cases in
  activity view
5. No interesting error messages or warning in Pg logs.
6. No crash of Pg backend

Other goodies includes a bounty of poor performing queries which are
constantly being optimized now for good measure.  Aside from the heavy
queries, performance is generallly decent.

Resource related server configs have been boosted substantially but
have not undergone any formal RD to verify that we're inthe safe
under heavy load.

An max_fsm_relations setting which is *below* our table and index
count was discovered by me today and will be increased this evening
during a maint cycle.

The slowdown and subsequent run-away app server takes place within a
small 2-5 minute window and I have as of yet not been able to get into
Psql during the event for a hands-on look.

Questions;

1. Is there any type of resource lock that can unconditionally block
  another session and NOT appear as UN-granted lock?

2. What in particular other runtime info would be most useful to
  sample here?

3. What Solaris side runtime stats might give some clues here
  (maybe?)( and how often to sample?  Assume needs to be aggressive
  due to how fast this problem crops up.

Any help appreciated

Thank you


 



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Steinar H. Gunderson
On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote:
 I wonder if there is a way to improve on that.

Ooh, the farthest pair problem (in an N-dimensional vector space, though).
I'm pretty sure problems like this has been studied quite extensively in the
literature, although perhaps not with the same norm. It's known under both
farthest pair and diameter, and probably others. I'm fairly sure it
should be solvable in at least O(n log n).

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Ron

At 05:16 PM 1/20/2006, Steinar H. Gunderson wrote:

On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote:
 I wonder if there is a way to improve on that.

Ooh, the farthest pair problem (in an N-dimensional vector space, though).
I'm pretty sure problems like this has been studied quite extensively in the
literature, although perhaps not with the same norm. It's known under both
farthest pair and diameter, and probably others. I'm fairly sure it
should be solvable in at least O(n log n).


If the N-dimensional space is Euclidean (any x, x+1 is the same 
distance apart in dimension x), then finding the farthest pair can be 
done in at least O(n).


If you do not want the actual distance and can create the proper data 
structures, particularly if you can update them incrementally as you 
generate pairs, it is often possible to solve this problem in O(lg n) or O(1).


I'll do some grinding.
Ron 




---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Martijn van Oosterhout
On Fri, Jan 20, 2006 at 04:50:17PM -0500, Tom Lane wrote:
 (hemdistcache calls hemdistsign --- I think gprof is doing something
 funny with tail-calls here, and showing hemdistsign as directly called
 from gtsvector_picksplit when control really arrives through hemdistcache.)

It may be the compiler. All these functions are declared static, which
gives the compiler quite a bit of leeway to rearrange code.

 The bulk of the problem is clearly in this loop, which performs O(N^2)
 comparisons to find the two entries that are furthest apart in hemdist
 terms:

Ah. A while ago someone came onto the list asking about bit strings
indexing[1]. If I'd known tsearch worked like this I would have pointed
him to it. Anyway, before he went off to implement it he mentioned
Jarvis-Patrick clustering, whatever that means.

Probably more relevent was this thread[2] on -hackers a while back with
pseudo-code[3]. How well it works, I don't know, it worked for him
evidently, he went away happy...

[1] http://archives.postgresql.org/pgsql-general/2005-11/msg00473.php
[2] http://archives.postgresql.org/pgsql-hackers/2005-11/msg01067.php
[3] http://archives.postgresql.org/pgsql-hackers/2005-11/msg01069.php

Hope this helps,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


signature.asc
Description: Digital signature


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Tom Lane
Ron [EMAIL PROTECTED] writes:
 For an even more extreme speedup, don't most modern CPUs have an asm 
 instruction that counts the bits (un)set (AKA population counting) 
 in various size entities (4b, 8b, 16b, 32b, 64b, and 128b for 64b 
 CPUs with SWAR instructions)?

Yeah, but fetching from a small constant table is pretty quick too;
I doubt it's worth getting involved in machine-specific assembly code
for this.  I'm much more interested in the idea of improving the
furthest-distance algorithm in gtsvector_picksplit --- if we can do
that, it'll probably drop the distance calculation down to the point
where it's not really worth the trouble to assembly-code it.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Steinar H. Gunderson
On Fri, Jan 20, 2006 at 05:29:46PM -0500, Ron wrote:
 If the N-dimensional space is Euclidean (any x, x+1 is the same 
 distance apart in dimension x), then finding the farthest pair can be 
 done in at least O(n).

That sounds a bit optimistic.

  
http://portal.acm.org/ft_gateway.cfm?id=167217type=pdfcoll=GUIDEdl=GUIDECFID=66230761CFTOKEN=72453878

is from 1993, but still it struggles with getting it down to O(n log n)
deterministically, for Euclidian 3-space, and our problem is not Euclidian
(although it still satisfies the triangle inequality, which sounds important
to me) and in a significantly higher dimension...

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Steinar H. Gunderson
On Fri, Jan 20, 2006 at 05:50:36PM -0500, Tom Lane wrote:
 Yeah, but fetching from a small constant table is pretty quick too;
 I doubt it's worth getting involved in machine-specific assembly code
 for this.  I'm much more interested in the idea of improving the
 furthest-distance algorithm in gtsvector_picksplit --- if we can do
 that, it'll probably drop the distance calculation down to the point
 where it's not really worth the trouble to assembly-code it.

For the record: Could we do with a less-than-optimal split here? In that
case, an extremely simple heuristic is:

  best = distance(0, 1)
  best_i = 0
  best_j = 1

  for i = 2..last:
  if distance(best_i, i)  best:
  best = distance(best_i, i)
  best_j = i
  else if distance(best_j, i)  best:
  best = distance(best_j, i)
  best_i = i

I've tested it on various data, and although it's definitely not _correct_,
it generally gets within 10%.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Tom Lane
Steinar H. Gunderson [EMAIL PROTECTED] writes:
 For the record: Could we do with a less-than-optimal split here?

Yeah, I was wondering the same.  The code is basically choosing two
seed values to drive the index-page split.  Intuitively it seems that
pretty far apart would be nearly as good as absolute furthest apart
for this purpose.

The cost of a less-than-optimal split would be paid on all subsequent
index accesses, though, so it's not clear how far we can afford to go in
this direction.

It's also worth considering that the entire approach is a heuristic,
really --- getting the furthest-apart pair of seeds doesn't guarantee
an optimal split as far as I can see.  Maybe there's some totally
different way to do it.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Steinar H. Gunderson
On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote:
 It's also worth considering that the entire approach is a heuristic,
 really --- getting the furthest-apart pair of seeds doesn't guarantee
 an optimal split as far as I can see.  Maybe there's some totally
 different way to do it.

For those of us who don't know what tsearch2/gist is trying to accomplish
here, could you provide some pointers? :-) During my mini-literature-search
on Google, I've found various algorithms for locating clusters in
high-dimensional metric spaces[1]; some of it might be useful, but I might
just be misunderstanding what the real problem is.

[1] http://ieeexplore.ieee.org/iel5/69/30435/01401892.pdf?arnumber=1401892 ,
for instance

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Tom Lane
Steinar H. Gunderson [EMAIL PROTECTED] writes:
 On Fri, Jan 20, 2006 at 06:52:37PM -0500, Tom Lane wrote:
 It's also worth considering that the entire approach is a heuristic,
 really --- getting the furthest-apart pair of seeds doesn't guarantee
 an optimal split as far as I can see.  Maybe there's some totally
 different way to do it.

 For those of us who don't know what tsearch2/gist is trying to accomplish
 here, could you provide some pointers? :-)

Well, we're trying to split an index page that's gotten full into two
index pages, preferably with approximately equal numbers of items in
each new page (this isn't a hard requirement though).  I think the true
figure of merit for a split is how often will subsequent searches have
to descend into *both* of the resulting pages as opposed to just one
--- the less often that is true, the better.  I'm not very clear on
what tsearch2 is doing with these bitmaps, but it looks like an upper
page's downlink has the union (bitwise OR) of the one-bits in the values
on the lower page, and you have to visit the lower page if this union
has a nonempty intersection with the set you are looking for.  If that's
correct, what you really want is to divide the values so that the unions
of the two sets have minimal overlap ... which seems to me to have
little to do with what the code does at present.

Teodor, Oleg, can you clarify what's needed here?

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Steinar H. Gunderson
On Fri, Jan 20, 2006 at 07:23:10PM -0500, Tom Lane wrote:
 I'm not very clear on what tsearch2 is doing with these bitmaps, but it
 looks like an upper page's downlink has the union (bitwise OR) of the
 one-bits in the values on the lower page, and you have to visit the lower
 page if this union has a nonempty intersection with the set you are looking
 for.  If that's correct, what you really want is to divide the values so
 that the unions of the two sets have minimal overlap ... which seems to me
 to have little to do with what the code does at present.

Sort of like the vertex-cover problem? That's probably a lot harder than
finding the two farthest points...

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Craig A. James

Tom Lane wrote:

Well, we're trying to split an index page that's gotten full into two
index pages, preferably with approximately equal numbers of items in
each new page (this isn't a hard requirement though).  ...  If that's
correct, what you really want is to divide the values so that the unions
of the two sets have minimal overlap ... which seems to me to have
little to do with what the code does at present.


This problem has been studied extensively by chemists, and they haven't found 
any easy solutions.

The Jarvis Patrick clustering algorithm might give you hints about a fast approach.  In 
theory it's K*O(N^2), but J-P is preferred for large datasets (millions of molecules) 
because the coefficient K can be made quite low.  It starts with a similarity 
metric for two bit strings, the Tanimoto or Tversky coefficients:

 http://www.daylight.com/dayhtml/doc/theory/theory.finger.html#RTFToC83

J-P Clustering is described here:

 http://www.daylight.com/dayhtml/doc/cluster/cluster.a.html#cl33

J-P Clustering is probably not the best for this problem (see the illustrations 
in the link above to see why), but the general idea of computing 
N-nearest-neighbors, followed by a partitioning step, could be what's needed.

Craig

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] [GENERAL] Creation of tsearch2 index is very slow

2006-01-20 Thread Steinar H. Gunderson
On Fri, Jan 20, 2006 at 05:46:34PM -0500, Ron wrote:
 For an even more extreme speedup, don't most modern CPUs have an asm 
 instruction that counts the bits (un)set (AKA population counting) 
 in various size entities (4b, 8b, 16b, 32b, 64b, and 128b for 64b 
 CPUs with SWAR instructions)?

None in the x86 series that I'm aware of, at least.

You have instructions for finding the highest set bit, though.

/* Steinar */
-- 
Homepage: http://www.sesse.net/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match