Re: [PERFORM] Index on two columns not used

2006-10-24 Thread Markus Schaber
Hi, Tom,

Tom Lane wrote:

> You're wrong.  An UPDATE always writes a new version of the row (if it
> overwrote the row in-place, it wouldn't be rollback-able).  The new
> version has a different TID and therefore the index entry must change.
> To support MVCC, our approach is to always insert a new index entry
> pointing at the new TID --- the old one remains in place so that the old
> version can still be found by transactions that need it.

OK, good you corrected me.

I had the weird impression that both row versions have the same tuple ID
(as they are different versions of the same tuple), and so an index
change is not necessary when both versions fit on the same page.

Thanks,
Markus

-- 
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Index on two columns not used

2006-10-23 Thread Péter Kovács

Markus,

Thank you for your kind explanation.

Peter

Markus Schaber wrote:

Hi, Peter,

Péter Kovács wrote:
  

Sorry for the amateurish question, but what are "heap tuples"?

Also, my understanding is that the following statement applies only for
composite indexes: "PostgreSQL can't use the values stored in the index
to check the join condition". I assume that PostgreSQL will be able to
use single-column-indexes for join conditions. Is this correct?



Both questions are tightly related:

First, the "heap" is the part of the table where the actual tuples are
stored.

PostgreSQL uses an MVCC system, that means that multiple versions (with
their transaction information) of a single row can coexist in the heap.
This allows for higher concurrency in the backend.

Now, the index basically stores pointers like "pages 23 and 42 contain
rows with value 'foo'", but version information is not replicated to the
index pages, this keeps the index' size requirements low.

Additionally, in most UPDATE cases, the new row version will fit into
the same page as the old version. In this case, the index does not have
to be changed, which is an additional speed improvement.

But when accessing the data via the index, it can only give a
preselection of pages that contain interesting data, and PostgreSQL has
to look into the actual heap pages to check whether there really are row
versions that are visible in the current transaction.


A further problem is that some GIST index types are lossy, that means
the index does not retain the full information, but only an
approximation, for efficiency reasons.

A prominent example are the PostGIS geometry indices, they only store
the bounding box (4 float values) instead of the whole geometry (may be
millions of double precision coordinates). So it may be necessary to
re-check the condition with the real data, using the lossy index for
preselection.

HTH,
Markus
  


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Index on two columns not used

2006-10-23 Thread Tom Lane
Markus Schaber <[EMAIL PROTECTED]> writes:
> Alvaro Herrera wrote:
>> Actually, when the UPDATE puts a new row version in the same heap page,
>> the index must be updated anyway.

> AFAICS only, when the index covers (directly or via function) a column
> that's actually changed.
> Changing columns the index does not depend on should not need any write
> access to that index.
> Correct me if I'm wrong.

You're wrong.  An UPDATE always writes a new version of the row (if it
overwrote the row in-place, it wouldn't be rollback-able).  The new
version has a different TID and therefore the index entry must change.
To support MVCC, our approach is to always insert a new index entry
pointing at the new TID --- the old one remains in place so that the old
version can still be found by transactions that need it.  Once the old
row version is entirely dead, VACUUM is responsible for removing both it
and the index entry pointing at it.

Other DBMSes use other approaches that shift the overhead to other
places, but that's how Postgres does it.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Index on two columns not used

2006-10-23 Thread Markus Schaber
Hi, Alvaro,

Alvaro Herrera wrote:

>> Additionally, in most UPDATE cases, the new row version will fit into
>> the same page as the old version. In this case, the index does not have
>> to be changed, which is an additional speed improvement.
> Actually, when the UPDATE puts a new row version in the same heap page,
> the index must be updated anyway.

AFAICS only, when the index covers (directly or via function) a column
that's actually changed.

Changing columns the index does not depend on should not need any write
access to that index.

Correct me if I'm wrong.

Thanks,
Markus
-- 
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Index on two columns not used

2006-10-23 Thread Alvaro Herrera
Markus Schaber wrote:

> Additionally, in most UPDATE cases, the new row version will fit into
> the same page as the old version. In this case, the index does not have
> to be changed, which is an additional speed improvement.

Actually, when the UPDATE puts a new row version in the same heap page,
the index must be updated anyway.  All the rest of what you said is
correct.

There is another reason not to put visibility info in the index, which
is that it would be extremely complex to update all indexes to contain
the right visibility (and maybe impossible without risking deadlocks).
Updating only the heap is very simple.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Index on two columns not used

2006-10-23 Thread Markus Schaber
Hi, Peter,

Péter Kovács wrote:
> Sorry for the amateurish question, but what are "heap tuples"?
> 
> Also, my understanding is that the following statement applies only for
> composite indexes: "PostgreSQL can't use the values stored in the index
> to check the join condition". I assume that PostgreSQL will be able to
> use single-column-indexes for join conditions. Is this correct?

Both questions are tightly related:

First, the "heap" is the part of the table where the actual tuples are
stored.

PostgreSQL uses an MVCC system, that means that multiple versions (with
their transaction information) of a single row can coexist in the heap.
This allows for higher concurrency in the backend.

Now, the index basically stores pointers like "pages 23 and 42 contain
rows with value 'foo'", but version information is not replicated to the
index pages, this keeps the index' size requirements low.

Additionally, in most UPDATE cases, the new row version will fit into
the same page as the old version. In this case, the index does not have
to be changed, which is an additional speed improvement.

But when accessing the data via the index, it can only give a
preselection of pages that contain interesting data, and PostgreSQL has
to look into the actual heap pages to check whether there really are row
versions that are visible in the current transaction.


A further problem is that some GIST index types are lossy, that means
the index does not retain the full information, but only an
approximation, for efficiency reasons.

A prominent example are the PostGIS geometry indices, they only store
the bounding box (4 float values) instead of the whole geometry (may be
millions of double precision coordinates). So it may be necessary to
re-check the condition with the real data, using the lossy index for
preselection.

HTH,
Markus
-- 
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Index on two columns not used

2006-10-21 Thread Péter Kovács

Sorry for the amateurish question, but what are "heap tuples"?

Also, my understanding is that the following statement applies only for 
composite indexes: "PostgreSQL can't use the values stored in the index 
to check the join condition". I assume that PostgreSQL will be able to 
use single-column-indexes for join conditions. Is this correct?


Thank you,
Peter

Heikki Linnakangas wrote:

Arnaud Lesauvage wrote:

I have two table with a 2-column index on both of them.
In the first table, the first colum of the index is the primary key, 
the second one is an integer field.

In the second table, the two columns are the primary key.
When I join these two tables, the 2-column index of the first table 
is not used.

Why does the query planner think that this plan  is better ?

ALTER TABLE geo.subcities_names
  ADD CONSTRAINT subcities_names_pkey PRIMARY KEY(subcity_gid, 
language_id);


CREATE INDEX subcities_gid_language_id
  ON geo.subcities
  USING btree
  (gid, official_language_id);

EXPLAIN ANALYZE
SELECT * FROM geo.subcities sc, geo.subcities_names scn
WHERE sc.gid = scn.subcity_gid AND sc.official_language_id = 
scn.language_id;


My theory:

There's no additional restrictions besides the join condition, so the 
system has to scan both tables completely. It chooses to use a full 
index scan instead of a seq scan to be able to do a merge join. 
Because it's going to have to scan the indexes completely anyway, it 
chooses the smallest index which is subcities_pkey.


You'd think that the system could do the merge using just the indexes, 
and only fetch the heap tuples for matches. If that were the case, 
using the 2-column index would indeed be a good idea. However, 
PostgreSQL can't use the values stored in the index to check the join 
condition, so all the heap tuples are fetched anyway. There was just 
recently discussion about this on this list: 
http://archives.postgresql.org/pgsql-performance/2006-09/msg00080.php.




---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Index on two columns not used

2006-10-18 Thread Arnaud Lesauvage

Tom Lane a écrit :

Arnaud Lesauvage <[EMAIL PROTECTED]> writes:
When I join these two tables, the 2-column index of the first table is 
not used.

Why does the query planner think that this plan  is better ?


Hm, is gid by itself nearly unique in these tables?  If so, the merge
join would get only marginally more efficient by using both columns as
merge conditions.  Heikki's probably right to guess that the planner
thinks it's better to use the smaller index.

However, if there are lots of duplicate gids, then it ought to have
preferred the two-column merge ...


gid is the primary key of the first table, so absolutely unique.
Thanks for the information Tom !

--
Arnaud

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Index on two columns not used

2006-10-18 Thread Tom Lane
Arnaud Lesauvage <[EMAIL PROTECTED]> writes:
> When I join these two tables, the 2-column index of the first table is 
> not used.
> Why does the query planner think that this plan  is better ?

Hm, is gid by itself nearly unique in these tables?  If so, the merge
join would get only marginally more efficient by using both columns as
merge conditions.  Heikki's probably right to guess that the planner
thinks it's better to use the smaller index.

However, if there are lots of duplicate gids, then it ought to have
preferred the two-column merge ...

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Index on two columns not used

2006-10-18 Thread Arnaud Lesauvage

Heikki Linnakangas a écrit :

Arnaud Lesauvage wrote:
It is quite typical, yes. It is the base query of a view. In fact, most 
views have a lot more joins (they join with all the upper-level tables).

But 150ms is OK, indeed.


If the query using the view does anything more than a "SELECT * FROM 
view", you should do an explain analyze of the query instead of the 
definition of the view. The access plan might look very different.


The views are used as linked tables in an Access Frontend.
Some accesses are "select * from view", others might filter 
on a country_id or something similar.
For the moment performance is good, so I think I'll keep a 
normalized database as long as it is possible !


Thanks for your help Heikki !

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Index on two columns not used

2006-10-18 Thread Heikki Linnakangas

Arnaud Lesauvage wrote:
It is quite typical, yes. It is the base query of a view. In fact, most 
views have a lot more joins (they join with all the upper-level tables).

But 150ms is OK, indeed.


If the query using the view does anything more than a "SELECT * FROM 
view", you should do an explain analyze of the query instead of the 
definition of the view. The access plan might look very different.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Index on two columns not used

2006-10-18 Thread Arnaud Lesauvage

Heikki Linnakangas a écrit :

Arnaud Lesauvage wrote:
This query was taken from my "adminsitrative areas" model (continents, 
countries, etc...). Whenever I query this model, I have to join many 
tables.
I don't really know what the overhead of reading the heap-tuples is, but 
would it be a good idea to add data-redundancy in my tables to avoid 
joins ? (adding country_id, continent_id, etc... in the "cities" table)


It depends. I would advise not to denormalize unless you really really 
have to. It's hard to say without more knowledge of the application.


Is the query you showed a typical one? It ran in about 160 ms, is that 
good enough? It's doesn't sound too bad, considering that it returned 
almost 4 rows.



It is quite typical, yes. It is the base query of a view. In 
fact, most views have a lot more joins (they join with all 
the upper-level tables).

But 150ms is OK, indeed.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Index on two columns not used

2006-10-18 Thread Heikki Linnakangas

Arnaud Lesauvage wrote:
I did not know that joins were not using index values, and that 
PostgreSQL had to fecth the heap tuples anyway.
Does this mean that this 2-column index is useless ? (I created it for 
the join, I don't often filter on both columns otherwise)


Well, if no-one is using the index, it is useless..

This query was taken from my "adminsitrative areas" model (continents, 
countries, etc...). Whenever I query this model, I have to join many 
tables.
I don't really know what the overhead of reading the heap-tuples is, but 
would it be a good idea to add data-redundancy in my tables to avoid 
joins ? (adding country_id, continent_id, etc... in the "cities" table)


It depends. I would advise not to denormalize unless you really really 
have to. It's hard to say without more knowledge of the application.


Is the query you showed a typical one? It ran in about 160 ms, is that 
good enough? It's doesn't sound too bad, considering that it returned 
almost 4 rows.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Index on two columns not used

2006-10-18 Thread Arnaud Lesauvage

Heikki Linnakangas a écrit :

Arnaud Lesauvage wrote:

I have two table with a 2-column index on both of them.
In the first table, the first colum of the index is the primary key, the 
second one is an integer field.

In the second table, the two columns are the primary key.
When I join these two tables, the 2-column index of the first table is 
not used.

Why does the query planner think that this plan  is better ?


You'd think that the system could do the merge using just the indexes, 
and only fetch the heap tuples for matches. If that were the case, using 
the 2-column index would indeed be a good idea. However, PostgreSQL 
can't use the values stored in the index to check the join condition, so 
all the heap tuples are fetched anyway. There was just recently 
discussion about this on this list: 
http://archives.postgresql.org/pgsql-performance/2006-09/msg00080.php.





Thanks for your answer Heikki.
I did not know that joins were not using index values, and 
that PostgreSQL had to fecth the heap tuples anyway.
Does this mean that this 2-column index is useless ? (I 
created it for the join, I don't often filter on both 
columns otherwise)


This query was taken from my "adminsitrative areas" model 
(continents, countries, etc...). Whenever I query this 
model, I have to join many tables.
I don't really know what the overhead of reading the 
heap-tuples is, but would it be a good idea to add 
data-redundancy in my tables to avoid joins ? (adding 
country_id, continent_id, etc... in the "cities" table)


Regards
--
Arnaud


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Index on two columns not used

2006-10-18 Thread Heikki Linnakangas

Arnaud Lesauvage wrote:

I have two table with a 2-column index on both of them.
In the first table, the first colum of the index is the primary key, the 
second one is an integer field.

In the second table, the two columns are the primary key.
When I join these two tables, the 2-column index of the first table is 
not used.

Why does the query planner think that this plan  is better ?

ALTER TABLE geo.subcities_names
  ADD CONSTRAINT subcities_names_pkey PRIMARY KEY(subcity_gid, 
language_id);


CREATE INDEX subcities_gid_language_id
  ON geo.subcities
  USING btree
  (gid, official_language_id);

EXPLAIN ANALYZE
SELECT * FROM geo.subcities sc, geo.subcities_names scn
WHERE sc.gid = scn.subcity_gid AND sc.official_language_id = 
scn.language_id;


My theory:

There's no additional restrictions besides the join condition, so the 
system has to scan both tables completely. It chooses to use a full 
index scan instead of a seq scan to be able to do a merge join. Because 
it's going to have to scan the indexes completely anyway, it chooses the 
smallest index which is subcities_pkey.


You'd think that the system could do the merge using just the indexes, 
and only fetch the heap tuples for matches. If that were the case, using 
the 2-column index would indeed be a good idea. However, PostgreSQL 
can't use the values stored in the index to check the join condition, so 
all the heap tuples are fetched anyway. There was just recently 
discussion about this on this list: 
http://archives.postgresql.org/pgsql-performance/2006-09/msg00080.php.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


[PERFORM] Index on two columns not used

2006-10-18 Thread Arnaud Lesauvage

Hi list !

I have two table with a 2-column index on both of them.
In the first table, the first colum of the index is the primary key, 
the second one is an integer field.

In the second table, the two columns are the primary key.
When I join these two tables, the 2-column index of the first table is 
not used.

Why does the query planner think that this plan  is better ?

ALTER TABLE geo.subcities_names
  ADD CONSTRAINT subcities_names_pkey PRIMARY KEY(subcity_gid, 
language_id);


CREATE INDEX subcities_gid_language_id
  ON geo.subcities
  USING btree
  (gid, official_language_id);

EXPLAIN ANALYZE
SELECT * FROM geo.subcities sc, geo.subcities_names scn
WHERE sc.gid = scn.subcity_gid AND sc.official_language_id = 
scn.language_id;


Result :

 Merge Join  (cost=0.00..4867.91 rows=37917 width=240) (actual 
time=0.037..149.022 rows=39323 loops=1)

   Merge Cond: ("outer".gid = "inner".subcity_gid)
   Join Filter: ("outer".official_language_id = "inner".language_id)
   ->  Index Scan using subcities_pkey on subcities sc 
(cost=0.00..1893.19 rows=39357 width=200) (actual time=0.015..43.430 
rows=39357 loops=1)
   ->  Index Scan using subcities_names_pkey on subcities_names scn 
(cost=0.00..2269.39 rows=40517 width=40) (actual time=0.012..35.465 
rows=40517 loops=1)

 Total runtime: 157.389 ms
(6 rows)


Thanks for your suggestions !
Regards
--
Arnaud

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly