Re: [PERFORM] make bulk deletes faster?

2005-12-19 Thread Mitch Skinner
On Sat, 2005-12-17 at 21:10 -0800, James Klo wrote:
 I need to routinely move data from the timeblock table to an archive 
 table with the same schema named timeblock_archive.  I really need this 
 to happen as quickly as possible, as the archive operation appears to 
 really tax the db server... 

Have you considered partitioning?

http://www.postgresql.org/docs/8.1/interactive/ddl-partitioning.html

If you can partition your timeblock table so that you archive an entire
partition at a time, then you can delete the archived rows by just
dropping (or truncating) that partition.  AFAIK there's no way to
re-parent a partition (e.g., from the timeblock table to the
timeblock_archive table).

If your app is particularly cooperative you might be able to use
partitioning to avoid moving data around entirely.  If table accesses
are always qualified by something you can use as a partitioning key,
then partitioning can give you the speed benefits of a small table
without the effort of keeping it cleared out.

Another good read, if you haven't yet, is
http://powerpostgresql.com/Downloads/annotated_conf_80.html
especially the Memory, Checkpoints, and maybe WAL options
sections.  If you're doing large deletes then you may need to increase
your free space map settings--if a VACUUM VERBOSE finishes by saying
that you need more FSM pages, then the table may have gotten bloated
over time (which can be fixed with a configuration change and a VACUUM
FULL, though this will lock everything else out of the table while it's
running).

Mitch


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Overriding the optimizer

2005-12-16 Thread Mitch Skinner
On Thu, 2005-12-15 at 18:23 -0800, Craig A. James wrote:
 So, you still have no problem is exactly wrong, because Postgres picked the 
 wrong plan.  Postgres decided that applying myfunc() to 10,000,000 rows was a 
 better plan than an index scan of 50,000 row_nums.  So I'm screwed.

FWIW,
The cost_functionscan procedure in costsize.c has the following comment:
/*
 * For now, estimate function's cost at one operator eval per
function
 * call.  Someday we should revive the function cost estimate
columns in * pg_proc...
 */

I recognize that you're trying to talk about the issue in general rather
than about this particular example.  However, the example does seem to
me to be exactly the case where the effort might be better spent
improving the optimizer (reviving the function cost estimate columns),
rather than implementing a general hinting facility.  Which one is more
effort?  I don't really know for sure, but cost_functionscan does seem
pretty straightforward.

What percentage of problems raised on this list can be fixed by setting
configuration parameters, adding indexes, increasing statistics, or
re-architecting a crazy schema?  I've only been lurking for a few
months, but it seems like a pretty large fraction.  Of the remainder,
what percentage represent actual useful feedback about what needs
improvement in the optimizer?  A pretty large fraction, I think.
Including your example.

Personally, I think whoever was arguing for selectivity hints in
-hackers recently made a pretty good point, so I'm partly on your side.
Actually, function cost hints don't really seem that much different
from selectivity hints, and both seem to me to be slicker solutions
(closer to the right level of abstraction) than a general hint facility.

Mitch


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Simple Join

2005-12-15 Thread Mitch Skinner
On Thu, 2005-12-15 at 01:48 -0600, Kevin Brown wrote:
  Well, I'm no expert either, but if there was an index on
  ordered_products (paid, suspended_sub, id) it should be mergejoinable
  with the index on to_ship.ordered_product_id, right?  Given the
  conditions on paid and suspended_sub.
 
 The following is already there:
 
 CREATE INDEX ordered_product_id_index
   ON to_ship
   USING btree
   (ordered_product_id);
 
 That's why I emailed this list.

I saw that; what I'm suggesting is that that you try creating a 3-column
index on ordered_products using the paid, suspended_sub, and id columns.
In that order, I think, although you could also try the reverse.  It may
or may not help, but it's worth a shot--the fact that all of those
columns are used together in the query suggests that you might do better
with a three-column index on those. 

With all three columns indexed individually, you're apparently not
getting the bitmap plan that Mark is hoping for.  I imagine this has to
do with the lack of multi-column statistics in postgres, though you
could also try raising the statistics target on the columns of interest.

Setting enable_seqscan to off, as others have suggested, is also a
worthwhile experiment, just to see what you get.

Mitch


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Database restore speed

2005-12-03 Thread Mitch Skinner
On Fri, 2005-12-02 at 23:03 -0500, Luke Lonergan wrote:
 And how do we compose the binary data on the client?  Do we trust that the 
 client encoding conversion logic is identical to the backend's?

Well, my newbieness is undoubtedly showing already, so I might as well
continue with my line of dumb questions.  I did a little mail archive
searching, but had a hard time coming up with unique query terms.

This is a slight digression, but my question about binary format query
results wasn't rhetorical.  Do I have to worry about different platforms
when I'm getting binary RowData(s) back from the server?  Or when I'm
sending binary bind messages?

Regarding whether or not the client has identical encoding/conversion
logic, how about a fast path that starts out by checking for
compatibility?  In addition to a BOM, you could add a float format
mark that was an array of things like +0.0, -0.0, min, max, +Inf, -Inf,
NaN, etc.

It looks like XDR specifies byte order for floats and otherwise punts to
IEEE.  I have no experience with SQL*Loader, but a quick read of the
docs appears to divide data types into portable and nonportable
groups, where loading nonportable data types requires extra care.

This may be overkill, but have you looked at HDF5?  Only one hit came up
in the mail archives.
http://hdf.ncsa.uiuc.edu/HDF5/doc/H5.format.html
For (e.g.) floats, the format includes metadata that specifies byte
order, padding, normalization, the location of the sign, exponent, and
mantissa, and the size of the exponent and mantissa.  The format appears
not to require length information on a per-datum basis.  A cursory look
at the data format page gives me the impression that there's a useful
streamable subset.  The license of the implementation is BSD-style (no
advertising clause), and it appears to support a large variety of
platforms.  Currently, the format spec only mentions ASCII, but since
the library doesn't do any actual string manipulation (just storage and
retrieval, AFAICS) it may be UTF-8 clean.

Mitch

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Database restore speed

2005-12-02 Thread Mitch Skinner
On Fri, 2005-12-02 at 13:24 -0800, Luke Lonergan wrote:
 It's a matter of safety and generality - in general you
 can't be sure that client machines / OS'es will render the same conversions
 that the backend does in all cases IMO.

Can't binary values can safely be sent cross-platform in DataRow
messages?  At least from my ignorant, cursory look at printtup.c,
there's a binary format code path.  float4send in utils/adt/float.c uses
pq_sendfloat4.  I obviously haven't followed the entire rabbit trail,
but it seems like it happens.

IOW, why isn't there a cross-platform issue when sending binary data
from the backend to the client in query results?  And if there isn't a
problem there, why can't binary data be sent from the client to the
backend?

Mitch

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] same plan, add 1 condition, 1900x slower

2005-11-11 Thread Mitch Skinner
On Thu, 2005-11-10 at 12:23 -0500, Tom Lane wrote:
 Apparently, you are using a platform and/or locale in which strcoll() is
 spectacularly, god-awfully slow --- on the order of 10 msec per comparison.

The version with the condition is definitely doing more I/O.  The
version without the condition doesn't read at all.  I strace'd an
explain analyze for each separately, and this is what I ended up with
(the first is with the condition, the second is without):

bash-2.05b$ cut '-d(' -f1 subsourcestrace | sort | uniq -c
   7127 gettimeofday
  75213 _llseek
  1 Process 30227 attached - interrupt to quit
  1 Process 30227 detached
 148671 read
  2 recv
  4 semop
  4 send
bash-2.05b$ cut '-d(' -f1 subsourcestrace-nocond | sort | uniq -c
   9103 gettimeofday
  7 _llseek
  1 Process 30227 attached - interrupt to quit
  1 Process 30227 detached
  2 recv
  4 send

For the moment, all of the rows in the view I'm selecting from satisfy
the condition, so the output of both queries is the same.  The relevant
rows of the underlying tables are probably pretty contiguous (all of the
rows satisfying the condition and the join were inserted at the same
time).  Could it just be the result of a weird physical distribution of
data in the table/index files?  For the fast query, the actual number of
rows is a lot less than the planner expects.

 This is a bit hard to believe but I can't make sense of those numbers
 any other way.  What is the platform exactly, and what database locale
 and encoding are you using?

It's RHEL 3 on x86:
[EMAIL PROTECTED] root]# uname -a
Linux rehoboam 2.4.21-32.0.1.ELsmp #1 SMP Tue May 17 17:52:23 EDT 2005
i686 i686 i386 GNU/Linux

The glibc version is 2.3.2.

statgen=# select current_setting('lc_collate');
 current_setting
-
 en_US.UTF-8

Not sure what's relevant, but here's some more info:
The machine has 4.5GiB of RAM and a 5-disk Raid 5.  It's a dual xeon
3.2ghz.

   relname   | relpages |  reltuples
-+--+-
 external_id_map |   126883 | 1.55625e+07
 external_id_map_primary_key |64607 | 1.55625e+07
 subject |   31 |1186
 subject_pkey|   19 |1186

I've attached the output of select name, setting from pg_settings.

And, in case my original message isn't handy, the explain analyze output
and table/view info is below.

Thanks for taking a look,
Mitch

statgen= explain analyze select * from subject_source;

QUERY PLAN 

 Merge Join  (cost=0.00..330.72 rows=1186 width=46) (actual
time=0.051..8.890 rows=1186 loops=1)
   Merge Cond: (outer.id = inner.target_id)
   -  Index Scan using subject_pkey on subject norm  (cost=0.00..63.36
rows=1186 width=28) (actual time=0.022..1.441 rows=1186 loops=1)
   -  Index Scan using external_id_map_primary_key on external_id_map
eim  (cost=0.00..2485226.70 rows=15562513 width=26) (actual
time=0.016..2.532 rows=2175 loops=1)
 Total runtime: 9.592 ms
(5 rows)

statgen= explain analyze select * from subject_source where
source='SCH';

QUERY PLAN 

 Merge Join  (cost=0.00..1147.33 rows=1 width=46) (actual
time=0.054..20258.161 rows=1186 loops=1)
   Merge Cond: (outer.id = inner.target_id)
   -  Index Scan using subject_pkey on subject norm  (cost=0.00..63.36
rows=1186 width=28) (actual time=0.022..1.478 rows=1186 loops=1)
   -  Index Scan using external_id_map_primary_key on external_id_map
eim  (cost=0.00..2524132.99 rows=2335 width=26) (actual
time=0.022..20252.326 rows=1186 loops=1)
 Filter: (source = 'SCH'::bpchar)
 Total runtime: 20258.922 ms
(6 rows)

statgen= \d subject_source
 View public.subject_source
  Column   | Type  | Modifiers
---+---+---
 id| bigint|
 sex   | integer   |
 parent1   | bigint|
 parent2   | bigint|
 source| character(3)  |
 source_id | character varying(32) |
View definition:
 SELECT norm.id, norm.sex, norm.parent1, norm.parent2, eim.source,
eim.source_id
   FROM subject norm
   JOIN util.external_id_map eim ON norm.id = eim.target_id;

statgen= \d subject
Table public.subject
 Column  |  Type   | Modifiers
-+-+---
 id  | bigint  | not null
 sex | integer |
 parent1 | bigint  |
 parent2 | bigint  |
Indexes:
subject_pkey PRIMARY KEY, btree (id)
Foreign-key constraints:
subject_parent1 FOREIGN KEY (parent1) REFERENCES subject(id)
DEFERRABLE INITIALLY DEFERRED
subject_parent2 FOREIGN KEY (parent2) REFERENCES subject(id)

[PERFORM] same plan, add 1 condition, 1900x slower

2005-11-10 Thread Mitch Skinner
This is with Postgres 8.0.3.  Any advice is appreciated.  I'm not sure
exactly what I expect, but I was hoping that if it used the
external_id_map_source_target_id index it would be faster.  Mainly I was
surprised that the same plan could perform so much differently with just
an extra condition.

I've increased the statistics target on util.external_id_map.source, but
I'm fuzzy on exactly where (what columns) the planner could use more
information.

statgen= explain analyze select * from subject_source;

QUERY PLAN 

 Merge Join  (cost=0.00..316.79 rows=1186 width=46) (actual
time=0.136..9.808 rows=1186 loops=1)
   Merge Cond: (outer.id = inner.target_id)
   -  Index Scan using subject_pkey on subject norm  (cost=0.00..63.36
rows=1186 width=28) (actual time=0.050..1.834 rows=1186 loops=1)
   -  Index Scan using external_id_map_primary_key on external_id_map
eim  (cost=0.00..2345747.01 rows=15560708 width=26) (actual
time=0.061..2.944 rows=2175 loops=1)
 Total runtime: 10.745 ms
(5 rows)

statgen= explain analyze select * from subject_source where
source='SCH';

QUERY PLAN 

 Merge Join  (cost=0.00..640.95 rows=1 width=46) (actual
time=0.043..21074.403 rows=1186 loops=1)
   Merge Cond: (outer.id = inner.target_id)
   -  Index Scan using subject_pkey on subject norm  (cost=0.00..63.36
rows=1186 width=28) (actual time=0.014..1.478 rows=1186 loops=1)
   -  Index Scan using external_id_map_primary_key on external_id_map
eim  (cost=0.00..2384648.78 rows=4150 width=26) (actual
time=0.020..21068.508 rows=1186 loops=1)
 Filter: (source = 'SCH'::bpchar)
 Total runtime: 21075.142 ms
(6 rows)

statgen= \d subject
Table public.subject
 Column  |  Type   | Modifiers
-+-+---
 id  | bigint  | not null
 sex | integer |
 parent1 | bigint  |
 parent2 | bigint  |
Indexes:
subject_pkey PRIMARY KEY, btree (id)
Foreign-key constraints:
subject_parent1 FOREIGN KEY (parent1) REFERENCES subject(id)
DEFERRABLE INITIALLY DEFERRED
subject_parent2 FOREIGN KEY (parent2) REFERENCES subject(id)
DEFERRABLE INITIALLY DEFERRED
subject_id_map FOREIGN KEY (id) REFERENCES
util.external_id_map(target_id) DEFERRABLE INITIALLY DEFERRED

statgen= \d subject_source
 View public.subject_source
  Column   | Type  | Modifiers
---+---+---
 id| bigint|
 sex   | integer   |
 parent1   | bigint|
 parent2   | bigint|
 source| character(3)  |
 source_id | character varying(32) |
View definition:
 SELECT norm.id, norm.sex, norm.parent1, norm.parent2, eim.source,
eim.source_id
   FROM subject norm
   JOIN util.external_id_map eim ON norm.id = eim.target_id;

statgen= \d util.external_id_map
 Table util.external_id_map
  Column   | Type  | Modifiers
---+---+---
 source_id | character varying(32) | not null
 source| character(3)  | not null
 target_id | bigint| not null
Indexes:
external_id_map_primary_key PRIMARY KEY, btree (target_id)
external_id_map_source_source_id_unique UNIQUE, btree (source,
source_id)
external_id_map_source btree (source)
external_id_map_source_target_id btree (source, target_id)
Foreign-key constraints:
external_id_map_source FOREIGN KEY (source) REFERENCES
util.source(id)

Thanks in advance,
Mitch

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match