Re: [SQL] Joining two large tables on a tiny subset of rows

2005-07-27 Thread Richard Huxton

Dmitri Bichko wrote:

Hello,

I have two tables, one has a foreing key from the other (only showing
the relevant columns and indices here):

 Table "expresso.probes"
   Column|  Type  | Modifiers
-++---
 platform_id | integer| not null
 probe_num   | integer| not null
 mrna_acc| character varying(50)  |
Indexes:
"idx_probes_mrna_acc" btree (mrna_acc, platform_id) WHERE mrna_acc
IS NOT NULL

   Table "expresso.mrna_info"
   Column| Type  | Modifiers
-+---+---
 mrna_acc| character varying(25) | not null
 symbol  | character varying(50) | not null
Indexes:
"idx_mrna_info_symbol" btree (upper(symbol::text)) WHERE symbol IS
NOT NULL


1. The two types of mrna_acc don't match - one has a max length of 25, 
one 50. Why?
2. With idx_probes_mrna_acc, why WHERE mrna_acc IS NOT NULL? NULLs 
aren't indexed anyway.
3. You say there is a foreign key, but I don't even see a primary key 
anywhere. I'm guessing mrna_info.mrna_acc is the primary key for that table.



Here are the explains for the two step process:

expression=> EXPLAIN ANALYZE SELECT mrna_acc FROM mrna_info WHERE
upper(symbol) = upper('pparg') AND symbol IS NOT NULL;


As someone else mentions, the IS NOT NULL is redundant.


 Index Scan using idx_mrna_info_symbol on mrna_info  (cost=0.00..2934.78
rows=930 width=12) (actual time=0.038..0.089 rows=12 loops=1)


Note that the estimated number of rows is wrong though (930 rather than 
the actual 12).



EXPLAIN ANALYZE SELECT platform_id, probe_num FROM probes WHERE mrna_acc
IN
('U10374','U09138','U01841','U01664','NM_015869','NM_013124','NM_011146'
,'NM_005037','D83233','BC021798','BC006811','AB011365') AND mrna_acc IS
NOT NULL;


Again, a redundant IS NOT NULL, which presumably you're putting in to 
use the index.



 Index Scan using idx_probes_mrna_acc, idx_probes_mrna_acc,
idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
idx_probes_mrna_acc, idx_probes_mrna_acc, idx_probes_mrna_acc,
idx_probes_mrna_acc on probes  (cost=0.00..14710.63 rows=4151 width=8)
(actual time=0.040..0.719 rows=142 loops=1)


Again, it's getting the row estimate badly wrong (4151 vs 142).


And here is the explain of the join (it's essentially the same plan as
the subselect and all the other ways I've tried):

expression=> explain SELECT platform_id, probe_num FROM mrna_info m,
probes p WHERE m.mrna_acc = p.mrna_acc and p.mrna_acc is not null and
UPPER(symbol) = UPPER('pparg') AND m.symbol IS NOT NULL;


Well, firstly get rid of the redundant "IS NOT NULL"s in the query and 
the indexes, then vacuum analyse the tables and post an EXPLAIN ANALYSE.


The problem will probably turn out to be poor row estimates (you can 
increase the statistics gathered on the mrna_acc values) or poor 
configuration settings (making indexes look expensive compared to 
sequential scans).


--
  Richard Huxton
  Archonet Ltd

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [SQL] REINDEX DATABASE

2005-07-27 Thread Chris Browne
[EMAIL PROTECTED] (Achilleus Mantzios) writes:
> O Chris Browne έγραψε στις Jul 26, 2005 :
>> You can get a savings of about 4% of the space, but at the cost of
>> taking an appreciable outage during which the database is not usable.
>
> 1st not all database is unusable, during the whole reindexdb run,
> 2nd outage outside office hours is acceptable for those who apply.

I'm not willing to assume that.

I help support some applications where "outages outside office hours"
are acceptable; I help support some other applications for which
"office hours" are 24 hours per day, 7 days per week, and such an
outage would be deemed decidedly unacceptable.

>> I wouldn't expect the 4% savings in space to lead to a particularly
>> measurable improvement in performance, certainly not one worth the
>> outage.

> I am just saying that the common saying "reindex is not needed for
> 7.4+" maybe is not true in all circumstances.

Let me quote the release notes for 7.4.1:

"Make free space map efficiently reuse empty index pages, and other
free space management improvements

In previous releases, B-tree index pages that were left empty
because of deleted rows could only be reused by rows with index
values similar to the rows originally indexed on that page. In
7.4, VACUUM records empty index pages and allows them to be reused
for any future index rows."



In versions earlier than 7.4, running a REINDEX periodically was
*essential* if you had update patterns consistent with the (remarkably
common) scenario described above.

This reason to reindex (which was the main reason we required
reindexing when using 7.2) has been resolved and gone away in 7.4.

There may be other factors that could mandate REINDEX; as far as I can
tell, the main such factor that remains would be where a table sees
enormous numbers of updates but is not VACUUMed often enough.

_That_ scenario isn't consistent with what you describe, as it would
be expected to involve a whole lot more than 4% growth in the size of
the database.
-- 
(format nil "[EMAIL PROTECTED]" "cbbrowne" "acm.org")
http://www.ntlug.org/~cbbrowne/sap.html
Rules of the Evil Overlord #78.  "I will not tell my Legions of Terror
"And he must  be taken alive!" The command will be:  ``And try to take
him alive if it is reasonably practical.''"


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[SQL] Foreign key with check?

2005-07-27 Thread Jeff Boes
Given a table like this:

create table primary (
  a integer primary key,
  b boolean
);

And another like this:

create table secondary (
  a integer,
  some_other_fields
);


I would like a foreign key constraint on the "secondary" table that
looks something like:

foreign key (a, true) references primary (a, b)

That is, a row in "secondary" is allowed to reference a row in "primary"
if and only if that referenced row has (b = true).

I think I have to do this with an additional trigger. Any other options?


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [SQL] Foreign key with check?

2005-07-27 Thread Bruno Wolff III
On Wed, Jul 27, 2005 at 16:08:19 -0400,
  Jeff Boes <[EMAIL PROTECTED]> wrote:
> Given a table like this:
> 
> create table primary (
>   a integer primary key,
>   b boolean
> );
> 
> And another like this:
> 
> create table secondary (
>   a integer,
>   some_other_fields
> );
> 
> 
> I would like a foreign key constraint on the "secondary" table that
> looks something like:
> 
> foreign key (a, true) references primary (a, b)
> 
> That is, a row in "secondary" is allowed to reference a row in "primary"
> if and only if that referenced row has (b = true).

If what you are saying is that all entries in secondary need to reference
a row in primary and in addition the referenced row must have b true, then
you can do this fairly simply using a bit of extra space.
Add a row b to secondary with a constraint that b = true. Then make the
foreign key reference on both a and b.

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [SQL] REINDEX DATABASE

2005-07-27 Thread Tom Lane
Chris Browne <[EMAIL PROTECTED]> writes:
> [EMAIL PROTECTED] (Achilleus Mantzios) writes:
>> I am just saying that the common saying "reindex is not needed for
>> 7.4+" maybe is not true in all circumstances.

> ...

> In versions earlier than 7.4, running a REINDEX periodically was
> *essential* if you had update patterns consistent with the (remarkably
> common) scenario described above.

> This reason to reindex (which was the main reason we required
> reindexing when using 7.2) has been resolved and gone away in 7.4.

> There may be other factors that could mandate REINDEX; as far as I can
> tell, the main such factor that remains would be where a table sees
> enormous numbers of updates but is not VACUUMed often enough.

I think the case that 7.4 resolved is where you have a
continually-moving window of index values; for example, an index on a
timestamp column in a table where you delete entries older than 30 days.
Before 7.4, index pages for timestamps older than 30 days would become
empty and then just sit there, with no other way to reclaim them than
REINDEX.

The case that isn't resolved yet is where you have a usage pattern that
causes a lot of index pages to become mostly but not entirely empty.
For example, your entries are timestamps, and you have a cleanup process
that removes just 99 out of every 100 successive entries.  This'll leave
you with just a couple of index entries per page, which might not be
infinite bloat but it's surely not too efficient.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org