Re: [PERFORM] make bulk deletes faster?
On Sat, 2005-12-17 at 21:10 -0800, James Klo wrote: I need to routinely move data from the timeblock table to an archive table with the same schema named timeblock_archive. I really need this to happen as quickly as possible, as the archive operation appears to really tax the db server... Have you considered partitioning? http://www.postgresql.org/docs/8.1/interactive/ddl-partitioning.html If you can partition your timeblock table so that you archive an entire partition at a time, then you can delete the archived rows by just dropping (or truncating) that partition. AFAIK there's no way to re-parent a partition (e.g., from the timeblock table to the timeblock_archive table). If your app is particularly cooperative you might be able to use partitioning to avoid moving data around entirely. If table accesses are always qualified by something you can use as a partitioning key, then partitioning can give you the speed benefits of a small table without the effort of keeping it cleared out. Another good read, if you haven't yet, is http://powerpostgresql.com/Downloads/annotated_conf_80.html especially the Memory, Checkpoints, and maybe WAL options sections. If you're doing large deletes then you may need to increase your free space map settings--if a VACUUM VERBOSE finishes by saying that you need more FSM pages, then the table may have gotten bloated over time (which can be fixed with a configuration change and a VACUUM FULL, though this will lock everything else out of the table while it's running). Mitch ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Overriding the optimizer
On Thu, 2005-12-15 at 18:23 -0800, Craig A. James wrote: So, you still have no problem is exactly wrong, because Postgres picked the wrong plan. Postgres decided that applying myfunc() to 10,000,000 rows was a better plan than an index scan of 50,000 row_nums. So I'm screwed. FWIW, The cost_functionscan procedure in costsize.c has the following comment: /* * For now, estimate function's cost at one operator eval per function * call. Someday we should revive the function cost estimate columns in * pg_proc... */ I recognize that you're trying to talk about the issue in general rather than about this particular example. However, the example does seem to me to be exactly the case where the effort might be better spent improving the optimizer (reviving the function cost estimate columns), rather than implementing a general hinting facility. Which one is more effort? I don't really know for sure, but cost_functionscan does seem pretty straightforward. What percentage of problems raised on this list can be fixed by setting configuration parameters, adding indexes, increasing statistics, or re-architecting a crazy schema? I've only been lurking for a few months, but it seems like a pretty large fraction. Of the remainder, what percentage represent actual useful feedback about what needs improvement in the optimizer? A pretty large fraction, I think. Including your example. Personally, I think whoever was arguing for selectivity hints in -hackers recently made a pretty good point, so I'm partly on your side. Actually, function cost hints don't really seem that much different from selectivity hints, and both seem to me to be slicker solutions (closer to the right level of abstraction) than a general hint facility. Mitch ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Simple Join
On Thu, 2005-12-15 at 01:48 -0600, Kevin Brown wrote: Well, I'm no expert either, but if there was an index on ordered_products (paid, suspended_sub, id) it should be mergejoinable with the index on to_ship.ordered_product_id, right? Given the conditions on paid and suspended_sub. The following is already there: CREATE INDEX ordered_product_id_index ON to_ship USING btree (ordered_product_id); That's why I emailed this list. I saw that; what I'm suggesting is that that you try creating a 3-column index on ordered_products using the paid, suspended_sub, and id columns. In that order, I think, although you could also try the reverse. It may or may not help, but it's worth a shot--the fact that all of those columns are used together in the query suggests that you might do better with a three-column index on those. With all three columns indexed individually, you're apparently not getting the bitmap plan that Mark is hoping for. I imagine this has to do with the lack of multi-column statistics in postgres, though you could also try raising the statistics target on the columns of interest. Setting enable_seqscan to off, as others have suggested, is also a worthwhile experiment, just to see what you get. Mitch ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Database restore speed
On Fri, 2005-12-02 at 23:03 -0500, Luke Lonergan wrote: And how do we compose the binary data on the client? Do we trust that the client encoding conversion logic is identical to the backend's? Well, my newbieness is undoubtedly showing already, so I might as well continue with my line of dumb questions. I did a little mail archive searching, but had a hard time coming up with unique query terms. This is a slight digression, but my question about binary format query results wasn't rhetorical. Do I have to worry about different platforms when I'm getting binary RowData(s) back from the server? Or when I'm sending binary bind messages? Regarding whether or not the client has identical encoding/conversion logic, how about a fast path that starts out by checking for compatibility? In addition to a BOM, you could add a float format mark that was an array of things like +0.0, -0.0, min, max, +Inf, -Inf, NaN, etc. It looks like XDR specifies byte order for floats and otherwise punts to IEEE. I have no experience with SQL*Loader, but a quick read of the docs appears to divide data types into portable and nonportable groups, where loading nonportable data types requires extra care. This may be overkill, but have you looked at HDF5? Only one hit came up in the mail archives. http://hdf.ncsa.uiuc.edu/HDF5/doc/H5.format.html For (e.g.) floats, the format includes metadata that specifies byte order, padding, normalization, the location of the sign, exponent, and mantissa, and the size of the exponent and mantissa. The format appears not to require length information on a per-datum basis. A cursory look at the data format page gives me the impression that there's a useful streamable subset. The license of the implementation is BSD-style (no advertising clause), and it appears to support a large variety of platforms. Currently, the format spec only mentions ASCII, but since the library doesn't do any actual string manipulation (just storage and retrieval, AFAICS) it may be UTF-8 clean. Mitch ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Database restore speed
On Fri, 2005-12-02 at 13:24 -0800, Luke Lonergan wrote: It's a matter of safety and generality - in general you can't be sure that client machines / OS'es will render the same conversions that the backend does in all cases IMO. Can't binary values can safely be sent cross-platform in DataRow messages? At least from my ignorant, cursory look at printtup.c, there's a binary format code path. float4send in utils/adt/float.c uses pq_sendfloat4. I obviously haven't followed the entire rabbit trail, but it seems like it happens. IOW, why isn't there a cross-platform issue when sending binary data from the backend to the client in query results? And if there isn't a problem there, why can't binary data be sent from the client to the backend? Mitch ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] same plan, add 1 condition, 1900x slower
On Thu, 2005-11-10 at 12:23 -0500, Tom Lane wrote: Apparently, you are using a platform and/or locale in which strcoll() is spectacularly, god-awfully slow --- on the order of 10 msec per comparison. The version with the condition is definitely doing more I/O. The version without the condition doesn't read at all. I strace'd an explain analyze for each separately, and this is what I ended up with (the first is with the condition, the second is without): bash-2.05b$ cut '-d(' -f1 subsourcestrace | sort | uniq -c 7127 gettimeofday 75213 _llseek 1 Process 30227 attached - interrupt to quit 1 Process 30227 detached 148671 read 2 recv 4 semop 4 send bash-2.05b$ cut '-d(' -f1 subsourcestrace-nocond | sort | uniq -c 9103 gettimeofday 7 _llseek 1 Process 30227 attached - interrupt to quit 1 Process 30227 detached 2 recv 4 send For the moment, all of the rows in the view I'm selecting from satisfy the condition, so the output of both queries is the same. The relevant rows of the underlying tables are probably pretty contiguous (all of the rows satisfying the condition and the join were inserted at the same time). Could it just be the result of a weird physical distribution of data in the table/index files? For the fast query, the actual number of rows is a lot less than the planner expects. This is a bit hard to believe but I can't make sense of those numbers any other way. What is the platform exactly, and what database locale and encoding are you using? It's RHEL 3 on x86: [EMAIL PROTECTED] root]# uname -a Linux rehoboam 2.4.21-32.0.1.ELsmp #1 SMP Tue May 17 17:52:23 EDT 2005 i686 i686 i386 GNU/Linux The glibc version is 2.3.2. statgen=# select current_setting('lc_collate'); current_setting - en_US.UTF-8 Not sure what's relevant, but here's some more info: The machine has 4.5GiB of RAM and a 5-disk Raid 5. It's a dual xeon 3.2ghz. relname | relpages | reltuples -+--+- external_id_map | 126883 | 1.55625e+07 external_id_map_primary_key |64607 | 1.55625e+07 subject | 31 |1186 subject_pkey| 19 |1186 I've attached the output of select name, setting from pg_settings. And, in case my original message isn't handy, the explain analyze output and table/view info is below. Thanks for taking a look, Mitch statgen= explain analyze select * from subject_source; QUERY PLAN Merge Join (cost=0.00..330.72 rows=1186 width=46) (actual time=0.051..8.890 rows=1186 loops=1) Merge Cond: (outer.id = inner.target_id) - Index Scan using subject_pkey on subject norm (cost=0.00..63.36 rows=1186 width=28) (actual time=0.022..1.441 rows=1186 loops=1) - Index Scan using external_id_map_primary_key on external_id_map eim (cost=0.00..2485226.70 rows=15562513 width=26) (actual time=0.016..2.532 rows=2175 loops=1) Total runtime: 9.592 ms (5 rows) statgen= explain analyze select * from subject_source where source='SCH'; QUERY PLAN Merge Join (cost=0.00..1147.33 rows=1 width=46) (actual time=0.054..20258.161 rows=1186 loops=1) Merge Cond: (outer.id = inner.target_id) - Index Scan using subject_pkey on subject norm (cost=0.00..63.36 rows=1186 width=28) (actual time=0.022..1.478 rows=1186 loops=1) - Index Scan using external_id_map_primary_key on external_id_map eim (cost=0.00..2524132.99 rows=2335 width=26) (actual time=0.022..20252.326 rows=1186 loops=1) Filter: (source = 'SCH'::bpchar) Total runtime: 20258.922 ms (6 rows) statgen= \d subject_source View public.subject_source Column | Type | Modifiers ---+---+--- id| bigint| sex | integer | parent1 | bigint| parent2 | bigint| source| character(3) | source_id | character varying(32) | View definition: SELECT norm.id, norm.sex, norm.parent1, norm.parent2, eim.source, eim.source_id FROM subject norm JOIN util.external_id_map eim ON norm.id = eim.target_id; statgen= \d subject Table public.subject Column | Type | Modifiers -+-+--- id | bigint | not null sex | integer | parent1 | bigint | parent2 | bigint | Indexes: subject_pkey PRIMARY KEY, btree (id) Foreign-key constraints: subject_parent1 FOREIGN KEY (parent1) REFERENCES subject(id) DEFERRABLE INITIALLY DEFERRED subject_parent2 FOREIGN KEY (parent2) REFERENCES subject(id)
[PERFORM] same plan, add 1 condition, 1900x slower
This is with Postgres 8.0.3. Any advice is appreciated. I'm not sure exactly what I expect, but I was hoping that if it used the external_id_map_source_target_id index it would be faster. Mainly I was surprised that the same plan could perform so much differently with just an extra condition. I've increased the statistics target on util.external_id_map.source, but I'm fuzzy on exactly where (what columns) the planner could use more information. statgen= explain analyze select * from subject_source; QUERY PLAN Merge Join (cost=0.00..316.79 rows=1186 width=46) (actual time=0.136..9.808 rows=1186 loops=1) Merge Cond: (outer.id = inner.target_id) - Index Scan using subject_pkey on subject norm (cost=0.00..63.36 rows=1186 width=28) (actual time=0.050..1.834 rows=1186 loops=1) - Index Scan using external_id_map_primary_key on external_id_map eim (cost=0.00..2345747.01 rows=15560708 width=26) (actual time=0.061..2.944 rows=2175 loops=1) Total runtime: 10.745 ms (5 rows) statgen= explain analyze select * from subject_source where source='SCH'; QUERY PLAN Merge Join (cost=0.00..640.95 rows=1 width=46) (actual time=0.043..21074.403 rows=1186 loops=1) Merge Cond: (outer.id = inner.target_id) - Index Scan using subject_pkey on subject norm (cost=0.00..63.36 rows=1186 width=28) (actual time=0.014..1.478 rows=1186 loops=1) - Index Scan using external_id_map_primary_key on external_id_map eim (cost=0.00..2384648.78 rows=4150 width=26) (actual time=0.020..21068.508 rows=1186 loops=1) Filter: (source = 'SCH'::bpchar) Total runtime: 21075.142 ms (6 rows) statgen= \d subject Table public.subject Column | Type | Modifiers -+-+--- id | bigint | not null sex | integer | parent1 | bigint | parent2 | bigint | Indexes: subject_pkey PRIMARY KEY, btree (id) Foreign-key constraints: subject_parent1 FOREIGN KEY (parent1) REFERENCES subject(id) DEFERRABLE INITIALLY DEFERRED subject_parent2 FOREIGN KEY (parent2) REFERENCES subject(id) DEFERRABLE INITIALLY DEFERRED subject_id_map FOREIGN KEY (id) REFERENCES util.external_id_map(target_id) DEFERRABLE INITIALLY DEFERRED statgen= \d subject_source View public.subject_source Column | Type | Modifiers ---+---+--- id| bigint| sex | integer | parent1 | bigint| parent2 | bigint| source| character(3) | source_id | character varying(32) | View definition: SELECT norm.id, norm.sex, norm.parent1, norm.parent2, eim.source, eim.source_id FROM subject norm JOIN util.external_id_map eim ON norm.id = eim.target_id; statgen= \d util.external_id_map Table util.external_id_map Column | Type | Modifiers ---+---+--- source_id | character varying(32) | not null source| character(3) | not null target_id | bigint| not null Indexes: external_id_map_primary_key PRIMARY KEY, btree (target_id) external_id_map_source_source_id_unique UNIQUE, btree (source, source_id) external_id_map_source btree (source) external_id_map_source_target_id btree (source, target_id) Foreign-key constraints: external_id_map_source FOREIGN KEY (source) REFERENCES util.source(id) Thanks in advance, Mitch ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match