Re: [PERFORM] Tuning
> -Original Message- > From: [EMAIL PROTECTED] [mailto:pgsql-performance- > [EMAIL PROTECTED] On Behalf Of John Parnefjord > Sent: Tuesday, January 30, 2007 2:05 AM > Subject: Re: [PERFORM] Tuning > EnterpriseDB advocates: 250 KB + 8.2 KB * shared_buffers + 14.2 kB * > max_connections up to infinity > [http://www.enterprisedb.com/documentation/kernel-resources.html] ... + 8.1KB * wal_buffers + 6 * max_fsm_pages + 65 * max_fsm_relations. Okay, maybe getting pedantic; but if you're going to cite the ~256KB const over head ... :-) ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] Hints proposal
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Bucky Jordan Sent: Thursday, October 12, 2006 2:19 PM To: josh@agliodbs.com; Jim C. Nasby Cc: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org Subject: Re: [HACKERS] [PERFORM] Hints proposal > > Well, one nice thing about the per-query method is you can post before > > and after EXPLAIN ANALYZE along with the hints. > > One bad thing is that application designers will tend to use the hint, fix > the immediate issue, and never report a problem at all. And query hints > would not be collectable in any organized way except the query log, which > would then require very sophisticated text parsing to get any useful > information at all. > Or they'll report it when the next version of Postgres "breaks" their app because the hints changed, or because the planner does something else which makes those hints obsolete. My main concern with hints (aside from the fact I'd rather see more intelligence in the planner/stats) is managing them appropriately. I have two general types of SQL where I'd want to use hints- big OLAP stuff (where I have a lot of big queries, so it's not just one or two where I'd need them) or large dynamically generated queries (Users building custom queries). Either way, I don't want to put them on a query itself. What about using regular expressions, plus, if you have a function (views, or any other statement that is stored), you can assign a rule to that particular function. So you get matching, plus explicit selection. This way it's easy to find all your hints, turn them off, manage them, etc. (Not to mention dynamically generated SQL is ugly enough without having to put hints in there). - Bucky ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Bad plan for join to aggregate of join.
Tom Lane wrote: Mischa Sandberg <[EMAIL PROTECTED]> writes: can PG see that a join on an grouped-by field can be pushed down into the query as an indexable filter? No. The GROUP BY serves as a partial optimization fence. If you're concerned about the speed of this query, I recommend making a different view in which 'message' is joined inside the GROUP BY. Thanks. ---(end of broadcast)--- TIP 6: explain analyze is your friend
[PERFORM] Bad plan for join to aggregate of join.
PG 8.0.3 is choosing a bad plan between a query. I'm going to force the plan (by making one join into a function). I'd like to know if this is unexpected; in general, can PG see that a join on an grouped-by field can be pushed down into the query as an indexable filter? The query below joins a table "message", to an aggregate of "message_recipient" joined to "recipient". The joins are all on indexed PK-FK columns. "message_recipient" is an intersect table. message :<: message_recipient :>: recipient In the query plan below, the right side of the join returns one row of "message", and PG knows it. The left side of the join compute the entire aggregate of "message_recipient" (est 700K rows), then does a merge join against the single message row. I would have hoped for a nested-loop join, where the message "id" field would be used to index-scan "message_recipient", which in turn would index-scan "recipient" by recipient "id". This is PG 8.0.3. All tables have been (very) recently analyzed. The query plans estimated rowcounts all look bang-on. "message" and "message_recipient" are tables of about 3M rows each. As usual, this is on a system to which I only have restricted access. But I'd be happy to expand on the info below with anything short of the pg_dump. --- EXPLAIN SELECT message.id AS m_db_id, message.m_global_id AS id, m_global_id, m_queue_id, h_message_id, m_date AS c_date_iso, m_date, c_subject_utf8, message.reason_id AS reason_id, m_reason.name AS m_reason, m_spam_probability, m_spam_level, h_to, m_message_size, m_header_size, date_part('epoch', message.m_date) AS c_qdate_time, h_from_local || '@' || h_from_domain AS h_from, env_from_local || '@' || env_from_domain AS env_from, env_from_local || '@' || env_from_domain AS m_envelope_from, location_name AS location, m_milter_host, m_relay, virus_name AS m_virus_name, m_all_recipients FROM message JOIN m_reason ON message.reason_id = m_reason.reason_id JOIN message_all_recipients ON message.id = message_all_recipients.m_id WHERE message.m_global_id = '2211000-1'; QUERY PLAN --- Nested Loop (cost=254538.42..283378.44 rows=1 width=425) Join Filter: ("outer".reason_id = "inner".reason_id) -> Merge Join (cost=254538.42..283377.33 rows=1 width=416) Merge Cond: ("outer".m_id = "inner".id) -> Subquery Scan message_all_recipients (cost=254535.40..281604.95 rows=707735 width=40) -> GroupAggregate (cost=254535.40..274527.60 rows=707735 width=36) -> Sort (cost=254535.40..258250.57 rows=1486069 width=36) Sort Key: message_recipient.message_id -> Merge Join (cost=0.00..78970.52 rows=1486069 width=36) Merge Cond: ("outer".id = "inner".recipient_id) -> Index Scan using pk_recipient on recipient (cost=0.00..5150.65 rows=204514 width=36) -> Index Scan using pk_message_recipient on message_recipient (cost=0.00..56818.25 rows=1486069 width=16) Filter: (is_mapped = 1) -> Sort (cost=3.02..3.03 rows=1 width=384) Sort Key: message.id -> Index Scan using unq_message_m_global_id on message (cost=0.00..3.01 rows=1 width=384) Index Cond: ((m_global_id)::text = '2211000-1'::text) -> Seq Scan on m_reason (cost=0.00..1.04 rows=4 width=13) --- Relevant tables and view: # \d message Table "public.message" Column |Type | Modifiers +-+- id | bigint | not null default nextval('public.message_id_seq'::text) m_global_id| character varying(255) | not null reason_id | smallint| not null location_name | character varying(255) | not null m_date | timestamp without time zone | m_queue_id | character varying(255) | h_message_id | character varying(255) | c_subject_utf8 | character varying(255) | env_from_local | character varying(255) | env_from_domain| character varying(255) | h_from_local | character varying(255) | h_from_domain | character varying(255) | h_from | character varying(255) | h_to | character varying(255) | m_milter_host | character varying(255) | m_relay| character varying(255) | m_spam_probability | double precision| m_message_size | integer | m_header_size
Re: [PERFORM] Optimizer internals
Mark Lewis wrote: On Thu, 2006-06-15 at 14:05 -0400, John Vincent wrote: Now I've been told by our DBA that we should have been able to wholy satisfy that query via the indexes. DB2 can satisfy the query using only indexes because DB2 doesn't do MVCC. You can get pretty much the same effect with materialized views. Create a table that LOOKS like the index (just those columns), with a foreign key relationship to the original table (cascade delete), and have the after-insert trigger on the main table write a row to the derived table. Now (index and) query the skinny table. Advantage of these tables: you can cluster them regularily, because it doesn't hard-lock the main table. -- Engineers think that equations approximate reality. Physicists think that reality approximates the equations. Mathematicians never make the connection. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Solaris shared_buffers anomaly?
Tom Lane wrote: Mischa Sandberg <[EMAIL PROTECTED]> writes: Tom Lane wrote: Does Solaris have any call that allows locking a shmem segment in RAM? Yes, mlock(). But want to understand what's going on before patching. Sure, but testing it with mlock() might help you understand what's going on, by eliminating one variable: we don't really know if the shmem is getting swapped, or something else. For a dedicated DB server machine, Solaris has a feature: create "intimate" shared memory with shmat(..., SHM_SHARE_MMU). All backends share the same TLB entries (!). We use that already. (Hmm, might be interesting for you to turn it *off* and see if anything changes. See src/backend/port/sysv_shmem.c.) Gah. Always must remember to RTFSource. And reproduce the problem on a machine I control :-) -- Engineers think that equations approximate reality. Physicists think that reality approximates the equations. Mathematicians never make the connection. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Solaris shared_buffers anomaly?
Jim C. Nasby wrote: On Tue, Jun 13, 2006 at 04:20:34PM -0700, Mischa Sandberg wrote: Jim C. Nasby wrote: What's sort_mem set to? I suspect you simply ran the machine out of memory. 8192 (8MB). No issue when shared_buffers was 2000; same apps always. So if all 50 backends were running a sort, you'd use 400MB. The box has 4G, right? Umm ... yes. "if". 35-40 of them are doing pure INSERTS. Not following your train. Yep, tested /etc/system segmap_percent at 20,40,60. No significant difference between 20 and 60. That's pretty disturbing... how large is your database? ~10GB. Good locality. Where heading? -- Engineers think that equations approximate reality. Physicists think that reality approximates the equations. Mathematicians never make the connection. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Solaris shared_buffers anomaly?
Jim C. Nasby wrote: On Tue, Jun 13, 2006 at 03:21:34PM -0700, Mischa Sandberg wrote: Raised shared_buffers to 16000 (128MB). DB server dropped to a CRAWL. vmstat showed that it was swapping like crazy. Dropped shared_buffers back down again. Swapping stopped. What's sort_mem set to? I suspect you simply ran the machine out of memory. 8192 (8MB). No issue when shared_buffers was 2000; same apps always. Also, Solaris by default will only use a portion of memory for filesystem caching, which will kill PostgreSQL performance. Yep, tested /etc/system segmap_percent at 20,40,60. No significant difference between 20 and 60. Default is 10%? 12%? Can't recall. Was not changed from 20 during the shared_buffer test. -- Engineers think that equations approximate reality. Physicists think that reality approximates the equations. Mathematicians never make the connection. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Solaris shared_buffers anomaly?
Tom Lane wrote: Mischa Sandberg <[EMAIL PROTECTED]> writes: vmstat showed that it was swapping like crazy. Dropped shared_buffers back down again. Swapping stopped. Does Solaris have any call that allows locking a shmem segment in RAM? Yes, mlock(). But want to understand what's going on before patching. No reason to believe that the multiply-attached shm seg was being swapped out (which is frankly insane). Swapping out (and in) just the true resident set of every backend would be enough to explain the vmstat io we saw. http://www.carumba.com/talk/random/swol-09-insidesolaris.html For a dedicated DB server machine, Solaris has a feature: create "intimate" shared memory with shmat(..., SHM_SHARE_MMU). All backends share the same TLB entries (!). Context switch rates on our in-house solaris boxes running PG have been insane (4000/sec). Reloading the TLB map on every process context switch might be one reason Solaris runs our db apps at less than half the speed of our perftesters' Xeon beige-boxes. That's guesswork. Sun is making PG part of their distro ... perhaps they've some knowledgeable input. -- Engineers think that equations approximate reality. Physicists think that reality approximates the equations. Mathematicians never make the connection. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[PERFORM] Solaris shared_buffers anomaly?
Jim C. Nasby wrote: ... Actually, in 8.1.x I've seen some big wins from greatly increasing the amount of shared_buffers, even as high as 50% of memory, thanks to the changes made to the buffer management code. ... Anyone else run into a gotcha that one of our customers ran into? PG 7.4.8 running on Solaris 2.6, USparc w 4GB RAM. Usually about 50 active backends. (No reason to believe this wouldn't apply to 8.x). Initially shared_buffers were set to 1000 (8MB). Then, we moved all apps but the database server off the box. Raised shared_buffers to 2000 (16MB). Modest improvement in some frequent repeated queries. Raised shared_buffers to 16000 (128MB). DB server dropped to a CRAWL. vmstat showed that it was swapping like crazy. Dropped shared_buffers back down again. Swapping stopped. Stared at "ps u" a lot, and realized that the shm seg appeared to be counted as part of the resident set (RSS). Theory was that the kernel was reading the numbers the same way, and swapping out resident sets, since they obviously wouldn't all fit in RAM :-) Anyone from Sun reading this list, willing to offer an opinion? ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] vacuuming problems continued
Joshua D. Drake wrote: - in our env, clients occasionally hit max_connections. This is a known and (sort of) desired pushback on load. However, that sometimes knocks pg_autovacuum out. That is when you use: superuser_reserved_connections Blush. Good point. Though, when we hit max_connections on 7.4.8 systems, it's been a lemonade-from-lemons plus that vacuuming didn't fire up on top of everything else :-) - db server goes down for any reason: same problem. I believe you can use stats_reset_on_server_start = on We do. The problem is not the loss of pg_stat_user_tables.(n_tup_ins,...) It's the loss of pg_autovacuum's CountAtLastVacuum (and ...Analyze) numbers, which are kept in process memory. Never considered patching pg_autovacuum to just sleep and try again, rather than exit, on a failed db connection. -- Engineers think that equations approximate reality. Physicists think that reality approximates the equations. Mathematicians never make the connection. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] vacuuming problems continued
Andrew Sullivan wrote: On Thu, Jun 01, 2006 at 01:54:08PM +0200, Antoine wrote: Hi, We just don't seem to be getting much benefit from autovacuum. Running a manual vacuum seems to still be doing a LOT, which suggests to me that I should either run a cron job and disable autovacuum, or just run a cron job on top of autovacuum. Don't know if this was covered in an earlier thread. Bear with me if so. I'm working with 7.4.8 and 8.0.3 systems, and pg_autovacuum does have some glitches ... in part solved by the integrated autovac in 8.1: - in our env, clients occasionally hit max_connections. This is a known and (sort of) desired pushback on load. However, that sometimes knocks pg_autovacuum out. - db server goes down for any reason: same problem. Just restarting pg_autovacuum is not good enough; when pg_autovacuum terminates, it loses its state, so big tables that change less than 50% between such terminations may never get vacuumed (!) For that reason, it's taken a switch to a Perl script run from cron every 5 minutes, that persists state in a table. The script is not a plug-compatible match for pg_autovacuum (hardcoded rates; hardcoded distinction between user and system tables), but you may find it useful. -- Engineers think that equations approximate reality. Physicists think that reality approximates the equations. Mathematicians never make the connection. #!/usr/bin/env perl use strict; use warnings; use Carp; use DBI; use POSIX qw(strftime); # Hardcoded (aggressive) autovacuum parameters: my ($VAC_BASE, $VAC_RATE) = (1000, 0.8); my ($ANA_BASE, $ANA_RATE) = ( 500, 0.4); my $VERBOSE = 'VERBOSE'; my $start = time; my $stamp = strftime " %FT%T autovac: ", localtime; open STDERR, ">&STDOUT"; # Redirect PG "VERBOSE" output. my $dbh = DBI->connect("dbi:Pg(PrintError=1,RaiseError=1):"); # REVISIT: move this to schema: my $oid = $dbh->selectall_arrayref(<<"__SQL__")->[0][0]; SELECT oid FROM pg_class WHERE relname = 'autovac_state'; __SQL__ $dbh->do(<<"__SQL__") if !defined $oid; CREATE TABLE public.autovac_state( relid oid NOT NULL PRIMARY KEY, nametext NOT NULL, analyze_timetimestamptz, vacuum_time timestamptz, analyze_tups bigint, -- (n_tup_ins+n_tup_upd+n_tup_del) at last ANALYZE vacuum_tups bigint -- (n_tup_upd+n_tup_del) at last VACUUM ); __SQL__ # Calculate updates/additions to autovac_state: my $anavac = join ";", map {$_->[0]} @{$dbh->selectall_arrayref(<<"__SQL__")}; SELECT * INTO TEMP new_state FROM ( SELECT relid, -- identify tables by ID, so that (re)created tables always -- are treated as fresh tables. name, -- for constructing the vacuum/analyze command old_relid, -- NULL means this will need a new state table entry analyze_tups,-- _tups are used to update autovac_state vacuum_tups, CASE WHEN analyze_tups - prev_analyze_tups NOT BETWEEN 0 AND analyze_point OR old_relid IS NULL THEN now() END AS analyze_time, CASE WHEN vacuum_tups - prev_vacuum_tups NOT BETWEEN 0 AND vacuum_point THEN now() END AS vacuum_time FROM ( SELECT N.nspname || '.' || C.relname AS name, A.relid AS old_relid, C.oid AS relid, S.n_tup_ins + S.n_tup_upd + S.n_tup_del AS analyze_tups, S.n_tup_upd + S.n_tup_del AS vacuum_tups, COALESCE(A.analyze_tups,0) AS prev_analyze_tups, COALESCE(A.vacuum_tups,0) AS prev_vacuum_tups, CASE WHEN nspname ~ '^pg_' THEN 1.0 ELSE $ANA_RATE END * C.reltuples + $ANA_BASE AS analyze_point, CASE WHEN nspname ~ '^pg_' THEN 2.0 ELSE $VAC_RATE END * C.reltuples + $VAC_BASE AS vacuum_point FROMpg_classAS C JOINpg_namespaceAS N ON N.oid = C.relnamespace JOINpg_stat_all_tables AS S ON S.relid = C.oid LEFT JOIN autovac_state AS A ON A.relid = S.relid WHERE N.nspname NOT LIKE 'pg_temp%' ) AS X ) AS X WHERE analyze_time IS NOT NULL OR vacuum_time IS NOT NULL; SELECT CASE WHEN vacuum_time IS NOT NULL THEN 'VACUUM ANALYZE $VERBOSE ' || name ELSE 'ANALYZE $VERBOSE ' || name END FROM new_state; __SQL__ if ($anavac) { print STDERR $stamp."start\n"; $dbh->do(<<"__SQL__"); $anavac; UPDATE autovac
Re: [PERFORM] Selects query stats?
Alvaro Herrera wrote: Yeah, the problem seems underspecified. So, Dan, the question is, what are you trying to measure? This might be a statistic that management has always been given, for Oracle, and you need to produce the "same" number for PostgreSQL. If not, it's hard to figure out what a statement counter actually can measure, to the extent that you can say, "If that number does THIS, I should do THAT." -- Engineers think that equations approximate reality. Physicists think that reality approximates the equations. Mathematicians never make the connection. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Selects query stats?
Dan Gorman wrote: What I am looking for is that our DB is doing X selects a min. What specifically would you like to measure? Duration for specific queries? Queries in an app for which you have no source? There may be a way to get what you want by other means ... Details? I gather you cannot just time the app that's doing the selects, nor extract those selects and run them via psql and time them on their own? Dan Gorman wrote: All, I might be completely crazy here, but it seems every other database exposes select query stats. Postgres only exposes updates/deletes/inserts. Is there something I am missing here? ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Selects query stats?
Dan Gorman wrote: All, I might be completely crazy here, but it seems every other database exposes select query stats. Postgres only exposes updates/deletes/inserts. Is there something I am missing here? Perhaps. You can EXPLAIN ANALYZE a SELECT, just like i/u/d -- but then you don't get the normal result set back. Is that what you mean? You can turn on log_min_duration_statement and get total SELECT duration logged. There's a thread in pgsql-hackers ("Re: Porting MSSQL to PGSQL: trace and profile") about server-side logging of query plans and stats (for all four of s/i/u/d), which is indeed not there in PG. -- Engineers think that equations approximate reality. Physicists think that reality approximates the equations. Mathematicians never make the connection. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Best hardware
Quoting Bernd Jagla <[EMAIL PROTECTED]>: > ... the speed of the head of the HD is actually > limitiing. Also, I only experimented with RAID5, and heard that > RAID10 will be good for reading but not writing. Au contraire. RAID5 is worse than RAID10 for writing, because it has the extra implicit read (parity stripe) for every write. I've switched all my perftest boxes over from RAID5 to RAID10, and the smallest performance increase was x1.6 . This is in an update-intensive system; the WAL log's disk write rate was the controlling factor. > Are Quad CPUs any good? I heard that the IBM quad system is supposed to be 40% > faster than HP or Dell???. Check out the other threads for negative experiences with Xeon 2x2 and perhaps quad CPU's. Me, I'm looking forward to my first Opteron box arriving next week. > And how much RAM should go for: are 8GB enough? Oh, of course I wanted to run it under RedHat... First off, you need enough RAM to hold all your connections. Run your app, watch the RSS column of "ps". For my own simpler apps (that pump data into the db) I allow 20MB/connection. Next, if you are heavy on inserts, your tables will never fit in RAM, and you really just need enough to hold the top levels of the indexes. Look at the disk space used in your $PGDATA/base// files, and you can work out whether holding ALL your indexes in memory is feasible. If you are heavy on updates, the above holds, but ymmv depending on locality of reference, you have to run your own tests. If you have concurrent big queries, all bets are off --- ask not how much RAM you need, but how much you can afford :-) ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] PostgreSQL strugling during high load
Quoting Tom Lane <[EMAIL PROTECTED]>: > "Mindaugas Riauba" <[EMAIL PROTECTED]> writes: > > ... So contents of database changes very fast. Problem is that > when > > pg_autovacuum does vacuum those changes slows down too much. > > The "vacuum cost" parameters can be adjusted to make vacuums fired > by pg_autovacuum less of a burden. I haven't got any specific > numbers > to suggest, but perhaps someone else does. I solved one problem by cranking sleep scaling to -S 20. It made pg_autovacuum back off longer during extended periods of heavy disk-intensive query activity. Our update activity is near-constant insert rate, then once or twice a day, massive deletes. -- Dreams come true, not free. ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Bad plan after vacuum analyze
Quoting Guillaume Smet <[EMAIL PROTECTED]>: > Hi, > > We have some performances problem on a particular query. ... I have to say it, this was the best laid-out set of details behind a problem I've ever seen on this list; I'm going to try live up to it, the next time I have a problem of my own. ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Federated PG servers -- Was: Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL
Was curious why you pointed out SQL-MED as a SQL-standard approach to federated servers. Always thought of it as covering access to non-SQL data, the way the lo_* interface works; as opposed to meshing compatible (to say nothing of identical) SQL servers. Just checked Jim Melton's last word on that, to make sure, too. Is there something beyond that, that I'm missing? The approach that made first best sense to me (perhaps from having gone there before) is to leave the SQL syntactically unchanged, and to manage federated relations via pg_ tables and probably procedures. MSSQL and Sybase went that route. It won't preclude moving to a system embedded in the SQL language. The hurdles for federated SQL service are: - basic syntax (how to refer to a remote object) - connection management and delegated security - timeouts and temporary connection failures - efficient distributed queries with >1 remote table - distributed transactions - interserver integrity constraints Sometimes the lines get weird because of opportunistic implementations. For example, for the longest time, MSSQL supported server.db.user.object references WITHIN STORED PROCEDURES, since the proc engine could hide some primitive connection management. PG struck me as such a natural for cross-server queries, because it keeps everything out in the open, including statistics. PG is also well set-up to handle heterogeneous table types, and has functions that return rowsets. Nothing needs to be bent out of shape syntactically, or in the cross-server interface, to get over the hurdles above. The fact that queries hence transactions can't span multiple databases tells me, PG has a way to go before it can handle dependency on a distributed transaction monitor. My 2c. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Prefetch
Quoting Christopher Kings-Lynne <[EMAIL PROTECTED]>: > > Another trick you can use with large data sets like this when you > want > > results > > back in seconds is to have regularly updated tables that aggregate > the data > > along each column normally aggregated against the main data set. > > > Maybe some bright person will prove me wrong by posting some > working > > information about how to get these apparently absent features > working. > > Most people just use simple triggers to maintain aggregate summary > tables... Don't know if this is more appropriate to bizgres, but: What the first poster is talking about is what OLAP cubes do. For big aggregating systems (OLAP), triggers perform poorly, compared to messy hand-rolled code. You may have dozens of aggregates at various levels. Consider the effect of having each detail row cascade into twenty updates. It's particularly silly-looking when data is coming in as batches of thousands of rows in a single insert, e.g. COPY temp_table FROM STDIN; UPDATE fact_table ... FROM ... temp_table INSERT INTO fact_table ...FROM...temp_table (the above pair of operations is so common, Oracle added its "MERGE" operator for it). Hence my recent post (request) for using RULES to aggregate --- given no luck with triggers "FOR EACH STATEMENT". ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Partitioning / Clustering
Quoting "Jim C. Nasby" <[EMAIL PROTECTED]>: > To the best of my knowledge no such work has been done. There is a > project (who's name escapes me) that lets you run queries against a > remote postgresql server from a postgresql connection to a different > server, which could serve as the basis for what you're proposing. Okay, if the following looks right to the powerthatbe, I'd like to start a project. Here's the proposition: "servername.dbname.schema.object" would change RangeVar, which would affect much code. "dbname.schema.object" itself is not implemented in 8.0. So, simplicity dictates something like: table pg_remote(schemaname text, connectby text, remoteschema text) The pg_statistic info from a remote server cannot be cached in local pg_statistic, without inventing pseudo reloids as well as a pseudoschema. Probably cleaner to cache it somewhere else. I'm still reading down the path that puts pg_statistic data where costsize can get at it. First step: find out whether one can link libpq.so to postmaster :-) ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Partitioning / Clustering
Quoting Christopher Kings-Lynne <[EMAIL PROTECTED]>: > > >>*laff* > >>Yeah, like they've been working on views for the last 5 years, and > >>still haven't released them :D :D :D > > > > ? > > http://dev.mysql.com/doc/mysql/en/create-view.html > > ...for MySQL 5.0.1+ ? > > Give me a call when it's RELEASED. :-) Touche' ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Partitioning / Clustering
Quoting Christopher Kings-Lynne <[EMAIL PROTECTED]>: > > This is why I mention partitioning. It solves this issue by storing > > different data sets on different machines under the same schema. > > These seperate chunks of the table can then be replicated as well for > > data redundancy and so on. MySQL are working on these things > *laff* > Yeah, like they've been working on views for the last 5 years, and > still haven't released them :D :D :D ? http://dev.mysql.com/doc/mysql/en/create-view.html ...for MySQL 5.0.1+ ? ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL
Quoting Tom Lane <[EMAIL PROTECTED]>: > Mischa Sandberg <[EMAIL PROTECTED]> writes: > > The PG hash join is the simplest possible: build a hash table in > memory, and match an input stream against it. > > [ raised eyebrow... ] Apparently you've not read the code. It's > been hybrid hashjoin since we got it from Berkeley. Probably not the > best possible implementation of the concept, but we do > understand about spill to disk. Apologies. I stopped reading around line 750 (PG 8.0.1) in src/backend/executor/nodeHashjoin.c if (!node->hj_hashdone) { /* * execute the Hash node, to build the hash table */ hashNode->hashtable = hashtable; (void) ExecProcNode((PlanState *) hashNode); ... and missed the comment: /* * Open temp files for outer batches, */ Will quietly go and read twice, talk once. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Partitioning / Clustering
Quoting Alex Stapleton <[EMAIL PROTECTED]>: > This is why I mention partitioning. It solves this issue by storing > different data sets on different machines under the same schema. > These seperate chunks of the table can then be replicated as well for > data redundancy and so on. MySQL are working on these things, but PG > just has a bunch of third party extensions, I wonder why these are > not being integrated into the main trunk :/ Thanks for pointing me to > PGCluster though. It looks like it should be better than Slony at > least. Across a decade or two of projects, including creating a federated database engine for Simba, I've become rather dubious of horizontal partitions (across disks or servers), either to improve performance, or just to scale up and not lose performance. [[The one exception is for non-time-critical read-only systems, with Slony-style replication.]] The most successful high-volume systems I've seen have broken up databases functionally, like a pipeline, where different applications use different sections of the pipe. The highest-volume system I've worked on is Acxiom's gigantic data-cleansing system. This is the central clearinghouse for every scrap of demographic that can be associated with some North American, somewhere. Think of D&B for 300M people (some dead). The volumes are just beyond belief, for both updates and queries. At Acxiom, the datasets are so large, even after partitioning, that they just constantly cycle them through memory, and commands are executes in convoys --- sort of like riding a paternoster. .. Anybody been tracking on what Mr Stonebraker's been up to, lately? Datastream management. Check it out. Like replication, everybody hand-rolled their own datastream systems until finally somebody else generalized it well enough that it didn't have to be built from scratch every time. Datastream systems require practically no locking, let alone distributed transactions. They give you some really strong guarantees on transaction elapsed-time and throughput. ... Where is this all leading? Well, for scaling data like this, the one feature that you need is the ability of procedures/rules on one server to perform queries/procedures on another. MSSQL has linked servers and (blech) OpenQuery. This lets you do reasonably-efficient work when you only deal with one table at a time. Do NOT try anything fancy with multi-table joins; timeouts are unavoidable, and painful. Postgres has a natural advantage in such a distributed server system: all table/index stats are openly available through the SQL interface, for one server to make rational query plans involving another server's resources. God! I would have killed for that when I was writing a federated SQL engine; the kluges you need to do this at arms-length from that information are true pain. So where should I go look, to see what's been done so far, on a Postgres that can treat another PG server as a new table type? ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL
Quoting "Jim C. Nasby" <[EMAIL PROTECTED]>: > Well, in a hash-join right now you normally end up feeding at least > one > side of the join with a seqscan. Wouldn't it speed things up > considerably if you could look up hashes in the hash index instead? You might want to google on "grace hash" and "hybrid hash". The PG hash join is the simplest possible: build a hash table in memory, and match an input stream against it. *Hybrid hash* is where you spill the hash to disk in a well-designed way. Instead of thinking of it as building a hash table in memory, think of it as partitioning one input; if some or all of it fits in memory, all the better. The boundary condition is the same. The real wizard of hybrid hash has to be Goetz Graefe, who sadly has now joined the MS Borg. He demonstrated that for entire-table joins, hybrid hash completely dominates sort-merge. MSSQL now uses what he developed as an academic, but I don't know what the patent state is. "Grace hash" is the original implementation of hybrid hash: Kitsuregawa, M., Tanaka, H., and Moto-oka, T. (1984). Architecture and Performance of Relational Algebra Machine Grace. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Partitioning / Clustering
Quoting [EMAIL PROTECTED]: > > exploring the option of buying 10 cheapass > > machines for $300 each. At the moment, that $300 buys you, from > Dell, a > > 2.5Ghz Pentium 4 > > Buy cheaper ass Dells with an AMD 64 3000+. Beats the crap out of > the 2.5 > GHz Pentium, especially for PostgreSQL. Whence "Dells with an AMD 64" ?? Perhaps you skimmed: http://www.thestreet.com/tech/kcswanson/10150604.html or http://www.eweek.com/article2/0,1759,1553822,00.asp ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] [GENERAL] "Hash index" vs. "b-tree index" (PostgreSQL
Quoting "Jim C. Nasby" <[EMAIL PROTECTED]>: > I'm not really familiar enough with hash indexes to know if this > would > work, but if the maximum bucket size was known you could use that to > determine a maximum range of buckets to look at. In some cases, that > range would include only one bucket, otherwise it would be a set of > buckets. If you found a set of buckets, I think you could then just > go > to the specific one you need. > > If we assume that the maximum bucket size is one page it becomes > more > realistic to take an existing large bucket and split it into several > smaller ones. This could be done on an update to the index page, or > a > background process could handle it. > > In any case, should this go on the TODO list? > > > Allowing a bucket size to be specified at CREATE INDEX doesn't seem > out > > of line though. We'd have to think up a scheme for > index-AM-specific > > index parameters ... > -- > Jim C. Nasby, Database Consultant [EMAIL PROTECTED] > Give your computer some brain candy! www.distributed.net Team #1828 Google "dynamic hash" or "linear hash". It takes care of not needing to have varying bucket sizes. Hash indexes are useful if you ALWAYS require disk access; they behave like worst-case random cache-thrash tests. That's probably why dbs have gravitated toward tree indexes instead. On the other hand, there's more (good) to hashing than initially meets the eye. Dynamic multiway hashing has come a long way from just splicing the bits together from multiple columns' hash values. If you can lay your hands on Tim Merrett's old text "Relational Information Systems", it's an eye-opener. Picture an efficient terabyte spreadsheet. For one thing, unlike a btree, a multicolumn hash is symmetric: it doesn't matter which column(s) you do not specify in a partial match. For another, a multiway hash is useful for much lower selectivity than a btree. I built such indexes for OLAP cubes, and some dimensions were only 10 elements wide. At the point where btree indexing becomes worse than seqscan, a multiway hash tells you which 10% of the disk to scan. ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Query tuning help
Quoting Russell Smith <[EMAIL PROTECTED]>: > On Mon, 9 May 2005 11:49 am, Dan Harris wrote: > > On May 8, 2005, at 6:51 PM, Russell Smith wrote: > [snip] > > select distinct em.incidentid, ea.recordtext as retdata, eg.long, eg.lat > > FROM em > > JOIN ea ON em.incidentid = ea.incidentid --- slight paraphrase /Mischa. > > AND em.entrydate between '2005-1-1' and '2005-5-9' > > AND ea.recordtext like '%RED%' AND ea.recordtext like '%CORVETTE%' > > Here's the situation: > > Due to the format of the systems with which I integrate ( I have no > > control over these formats ), we will get these 'recordtext' values one > > line at a time, accumulating over time. The only way I can find to > > make this work is to insert a new record for each line. The problem > > is, that when someone wants to search multiple keywords, they expect > > these words to be matched across multiple records with a given incident > > number. > > > > For a very simple example: > > > > IncidentID DateRecordtext > > -- - > > 1 2005-05-01 14:21 blah blah blah RED blah blah > > 2005-05-01 14:23 not what we are looking for > > 1 2005-05-02 02:05 blah CORVETTE blah blah > > select em.incidentid, ea.recordtest as retdata fromem join ( -- equivalent to "where incidentid in (...)", sometimes faster. select incidentid from em join ea using (incidentid) where em.entrydate between '2005-1-1' and '2005-5-9' group by incidentid having 1 = min(case when recordtest like '%RED%' then 1 end) and 1 = min(case when recordtest like '%CORVETTE%' then 1 end) ) as X using (incidentid); ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Whence the Opterons?
Thanks to everyone for their pointers to suppliers of Opteron systems. The system I'm pricing is under a tighter budget than a production machine, because it will be for perftests. Our customers tend to run on Dells but occasionally run on (Sun) Opterons. ---(end of broadcast)--- TIP 8: explain analyze is your friend
[PERFORM] Whence the Opterons?
After reading the comparisons between Opteron and Xeon processors for Linux, I'd like to add an Opteron box to our stable of Dells and Sparcs, for comparison. IBM, Sun and HP have their fairly pricey Opteron systems. The IT people are not swell about unsupported purchases off ebay. Anyone care to suggest any other vendors/distributors? Looking for names with national support, so that we can recommend as much to our customers. Many thanks in advance. -- "Dreams come true, not free." -- S.Sondheim ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] COPY vs INSERT
Quoting Kris Jurka <[EMAIL PROTECTED]>: > On Wed, 4 May 2005, Mischa Sandberg wrote: > > > Copy makes better use of the TCP connection for transmission. COPY > uses > > the TCP connection like a one-way pipe. INSERT is like an RPC: the > > sender has to wait until the insert's return status roundtrips. > > Not true. A client may send any number of Bind/Execute messages on a > prepared statement before a Sync message. So multiple inserts may be > sent > in one network roundtrip. This is exactly how the JDBC driver > implements batch statements. There is some limit to the number of > queries > in flight at any given moment because there is the potential to > deadlock > if both sides of network buffers are filled up and each side is > blocked > waiting on a write. The JDBC driver has conservatively selected 256 > as > the maximum number of queries to send at once. Hunh. Interesting optimization in the JDBC driver. I gather it is sending a string of (;)-separated inserts. Sounds like efficient-but-risky stuff we did for ODBC drivers at Simba ... gets interesting when one of the insert statements in the middle fails. Good to know. Hope that the batch size is parametric, given that you can have inserts with rather large strings bound to 'text' columns in PG --- harder to identify BLOBs when talking to PG, than when talking to MSSQL/Oracle/Sybase. -- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] COPY vs INSERT
Quoting David Roussel <[EMAIL PROTECTED]>: > > COPY invokes all the same logic as INSERT on the server side > > (rowexclusive locking, transaction log, updating indexes, rules). > > The difference is that all the rows are inserted as a single > > transaction. This reduces the number of fsync's on the xlog, > > which may be a limiting factor for you. You'll want to crank > > WAL_BUFFERS and CHECKPOINT_SEGMENTS to match, though. > > One of my streams has 6K records; I run with WB=1000, CS=128. > > So what's the difference between a COPY and a batch of INSERT > statements. Also, surely, fsyncs only occur at the end of a > transaction, no need to fsync before a commit has been issued, > right? Sorry, I was comparing granularities the other way araound. As far as xlog is concerned, a COPY is ALWAYS one big txn, no matter how many putline commands you use to feed the copy. With inserts, you can choose whether to commit every row, every nth row, etc. Copy makes better use of the TCP connection for transmission. COPY uses the TCP connection like a one-way pipe. INSERT is like an RPC: the sender has to wait until the insert's return status roundtrips. -- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection. ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Quoting Josh Berkus : > Mischa, > > > Okay, although given the track record of page-based sampling for > > n-distinct, it's a bit like looking for your keys under the > streetlight, > > rather than in the alley where you dropped them :-) > > Bad analogy, but funny. Bad analogy? Page-sampling effort versus row-sampling effort, c'est moot. It's not good enough for stats to produce good behaviour on the average. Straight random sampling, page or row, is going to cause enough untrustworthy engine behaviour,for any %ages small enough to allow sampling from scratch at any time. I'm curious what the problem is with relying on a start-up plus incremental method, when the method in the distinct-sampling paper doesn't degenerate: you can start when the table is still empty. Constructing an index requires an initial full scan plus incremental update; what's the diff? > Unless, of course, we use indexes for sampling, which seems like a > *really > good* idea to me "distinct-sampling" applies for indexes, too. I started tracking the discussion of this a bit late. Smart method for this is in VLDB'92: Gennady Antoshenkov, "Random Sampling from Pseudo-ranked B+-trees". I don't think this is online anywhere, except if you have a DBLP membership. Does nybod else know better? Antoshenkov was the brains behind some of the really cool stuff in DEC Rdb (what eventually became Oracle). Compressed bitmap indices, parallel competing query plans, and smart handling of keys with hyperbolic distributions. -- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Quoting Markus Schaber <[EMAIL PROTECTED]>: > Hi, Josh, > > Josh Berkus wrote: > > > Yes, actually. We need 3 different estimation methods: > > 1 for tables where we can sample a large % of pages (say, >= 0.1) > > 1 for tables where we sample a small % of pages but are "easily > estimated" > > 1 for tables which are not easily estimated by we can't afford to > sample a > > large % of pages. > > > > If we're doing sampling-based estimation, I really don't want > people to lose > > sight of the fact that page-based random sampling is much less > expensive than > > row-based random sampling. We should really be focusing on > methods which > > are page-based. Okay, although given the track record of page-based sampling for n-distinct, it's a bit like looking for your keys under the streetlight, rather than in the alley where you dropped them :-) How about applying the distinct-sampling filter on a small extra data stream to the stats collector? -- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection. ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] COPY vs INSERT
> Steven Rosenstein <[EMAIL PROTECTED]> writes: > > My question is, are there any advantages, drawbacks, or outright > > restrictions to using multiple simultaneous COPY commands to load > data into > > the same table? Do you mean, multiple COPY commands (connections) being putline'd from the same thread (process)? I have indirect evidence that this may hurt. Two copy commands from different threads/processes are fine, and can help, if they alternate contention on some other resource (disk/CPU). I'm basing this on being at the third generation of a COPY implementation. The app loads about 1M objects/hour from 6 servers. Each object is split across four tables. The batch load opens four connections and firehoses records down each. A batch is 10K objects. COPY invokes all the same logic as INSERT on the server side (rowexclusive locking, transaction log, updating indexes, rules). The difference is that all the rows are inserted as a single transaction. This reduces the number of fsync's on the xlog, which may be a limiting factor for you. You'll want to crank WAL_BUFFERS and CHECKPOINT_SEGMENTS to match, though. One of my streams has 6K records; I run with WB=1000, CS=128. The downside I found with multiple clients inserting large blocks of rows was, that they serialized. I THINK that's because at some point they all needed to lock the same portions of the same indexes. I'm still working on how to avoid that, tuning the batch size and inserting into a "queue" table with fewer indexes. COPY (via putline) didn't do measurably better than INSERT until I batched 40 newline-separate rows into one putline call, which improved it 2-3:1. The suspect problem was stalling on the TCP stream; the driver was flushing small packets. This may or may not be relevant to you; depends on how much processing (waiting) your app does between posting of rows. In such a case, writing alternately to two TCP streams from the same process increases the likelihood of a stall. I've never tested that set-up; it would have been heading AWAY from the solution in my case. Hope that helps. -- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Quoting Josh Berkus : > > >Perhaps I can save you some time (yes, I have a degree in Math). If I > > >understand correctly, you're trying extrapolate from the correlation > > >between a tiny sample and a larger sample. Introducing the tiny sample > > >into any decision can only produce a less accurate result than just > > >taking the larger sample on its own; GIGO. Whether they are consistent > > >with one another has no relationship to whether the larger sample > > >correlates with the whole population. You can think of the tiny sample > > >like "anecdotal" evidence for wonderdrugs. > > Actually, it's more to characterize how large of a sample we need. For > example, if we sample 0.005 of disk pages, and get an estimate, and then > sample another 0.005 of disk pages and get an estimate which is not even > close to the first estimate, then we have an idea that this is a table which > defies analysis based on small samples. Wheras if the two estimates are < > 1.0 stdev apart, we can have good confidence that the table is easily > estimated. Note that this doesn't require progressively larger samples; any > two samples would work. We're sort of wandering away from the area where words are a good way to describe the problem. Lacking a common scratchpad to work with, could I suggest you talk to someone you consider has a background in stats, and have them draw for you why this doesn't work? About all you can get out of it is, if the two samples are disjunct by a stddev, yes, you've demonstrated that the union of the two populations has a larger stddev than either of them; but your two stddevs are less info than the stddev of the whole. Breaking your sample into two (or three, or four, ...) arbitrary pieces and looking at their stddevs just doesn't tell you any more than what you start with. -- "Dreams come true, not free." -- S.Sondheim, ITW ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Suggestions for a data-warehouse migration routine
Quoting Richard Rowell <[EMAIL PROTECTED]>: > I've ported enough of my companies database to Postgres to make > warehousing on PG a real possibility. I thought I would toss my > data > migration architecture ideas out for the list to shoot apart.. > [...] Not much feedback required. Yes, dropping the entire database is faster and simpler. If your database is small enough that you can rebuild it from scratch every time, go for it. Yes, vacuum analyze required; creating indexes alone does not create statistics. >From a I'd dump an extract of pg_stat[io_]user_(tables|indexes) to see how index usage and table load changes over time. -- "Dreams come true, not free." -- S.Sondheim, ITW ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks suggested?
Quoting Andrew Dunstan <[EMAIL PROTECTED]>: > After some more experimentation, I'm wondering about some sort of > adaptive algorithm, a bit along the lines suggested by Marko Ristola, but limited to 2 rounds. > > The idea would be that we take a sample (either of fixed size, or > some small proportion of the table) , see how well it fits a larger sample > > (say a few times the size of the first sample), and then adjust the > formula accordingly to project from the larger sample the estimate for the full population. Math not worked out yet - I think we want to ensure that the result remains bounded by [d,N]. Perhaps I can save you some time (yes, I have a degree in Math). If I understand correctly, you're trying extrapolate from the correlation between a tiny sample and a larger sample. Introducing the tiny sample into any decision can only produce a less accurate result than just taking the larger sample on its own; GIGO. Whether they are consistent with one another has no relationship to whether the larger sample correlates with the whole population. You can think of the tiny sample like "anecdotal" evidence for wonderdrugs. -- "Dreams come true, not free." -- S.Sondheim, ITW ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Joel's Performance Issues WAS : Opteron vs Xeon
Quoting Alvaro Herrera <[EMAIL PROTECTED]>: > One further question is: is this really a meaningful test? I mean, in > production are you going to query 30 rows regularly? And is the > system always going to be used by only one user? I guess the question > is if this big select is representative of the load you expect in > production. While there may be some far-out queries that nobody would try, you might be surprised what becomes the norm for queries, as soon as the engine feasibly supports them. SQL is used for warehouse and olap apps, as a data queue, and as the co-ordinator or bridge for (non-SQL) replication apps. In all of these, you see large updates, large result sets and volatile tables ("large" to me means over 20% of a table and over 1M rows). To answer your specific question: yes, every 30 mins, in a data redistribution app that makes a 1M-row query, and writes ~1000 individual update files, of overlapping sets of rows. It's the kind of operation SQL doesn't do well, so you have to rely on one big query to get the data out. My 2c -- "Dreams come true, not free." -- S.Sondheim, ITW ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Index bloat problem?
Quoting Bill Chandler <[EMAIL PROTECTED]>: > ... The normal activity is to delete 3-5% of the rows per day, > followed by a VACUUM ANALYZE. ... > However, on occasion, deleting 75% of rows is a > legitimate action for the client to take. > > In case nobody else has asked: is your max_fsm_pages > > big enough to handle all the deleted pages, > > across ALL tables hit by the purge? > This parameter is most likely set incorrectly. So > that could be causing problems. Could that be a > culprit for the index bloat, though? Look at the last few lines of vacuum verbose output. It will say something like: free space map: 55 relations, 88416 pages stored; 89184 total pages needed Allocated FSM size: 1000 relations + 100 pages = 5920 kB shared memory. "100" here is [max_fsm_pages] from my postgresql.conf. If the "total pages needed" is bigger than the pages fsm is allocated for, then you are bleeding. -- "Dreams come true, not free." -- S.Sondheim, ITW ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Slow copy with little CPU/disk usage
Quoting Tom Lane <[EMAIL PROTECTED]>: > "Jim C. Nasby" <[EMAIL PROTECTED]> writes: > > A friend of mine has an application where he's copying in 4000 rows at a > > time into a table that has about 4M rows. Each row is 40-50 bytes. This > > is taking 25 seconds on a dual PIII-1GHz with 1G of RAM and a 2 disk > > SATA mirror, running FBSD 4.10-stable. There's one index on the table. > > If there's no hidden costs such as foreign key checks, that does seem > pretty dang slow. > > > What's really odd is that neither the CPU or the disk are being > > hammered. The box appears to be pretty idle; the postgresql proces is > > using 4-5% CPU. -- This sounds EXACTLY like my problem, if you make the box to a Xeon 2.4GHz, 2GB RAM ... with two SCSI drives (xlog and base); loading 10K rows of about 200 bytes each; takes about 20 secs at the best, and much longer at the worst. By any chance does your friend have several client machines/processes trying to mass-load rows at the same time? Or at least some other processes updating that table in a bulkish way? What I get is low diskio, low cpu, even low context-switches ... and I'm betting he should take a look at pg_locks. For my own problem, I gather that an exclusive lock is necessary while updating indexes and heap, and the multiple processes doing the update can make that pathological. Anyway, have your friend check pg_locks. "Dreams come true, not free." -- S.Sondheim, ITW ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[PERFORM] Strange serialization problem
I have a performance problem; I'd like any suggestions on where to continue investigation. A set of insert-only processes seems to serialize itself. :-( The processes appear to be blocked on disk IO, and probably the table drive, rather than the pg_xlog drive. Each process is inserting a block of 10K rows into a table. I'm guessing they are "serialized" because one process by itself takes 15-20 secs; running ten processes in parallel averages 100-150 secs (each), with elapsed (wall) time of 150-200 secs. Polling pg_locks shows each process has (been granted) only the locks you would expect. I RARELY see an Exclusive lock on an index, and then only on one index at a time. A sample from pg_locks: TABLE/INDEX GRANTED PID MODE m_reason t 7340 AccessShare messaget 7340 AccessShare messaget 7340 RowExclusive pk_message t 7340 AccessShare tmp_messaget 7340 AccessShare ("m_reason" is a one-row lookup table; see INSERT cmd below). -- The query plan is quite reasonable (see below). On a side note, this is the first app I've had to deal with that is sweet to pg_xlog, but hammers the drive bearing the base table (3x the traffic). "log_executor_stats" for a sample insert look reasonable (except the "elapsed"!) ! system usage stats: ! 308.591728 elapsed 3.48 user 1.27 system sec ! [4.00 user 1.39 sys total] ! 0/0 [0/0] filesystem blocks in/out ! 18212/15 [19002/418] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 0/0 [0/0] voluntary/involuntary context switches ! buffer usage stats: ! Shared blocks: 9675 read, 8781 written, buffer hit rate = 97.66% ! Local blocks:504 read, 64 written, buffer hit rate = 0.00% ! Direct blocks: 0 read, 0 written Summarized "ps" output for the above backend process, sampled every 5 secs, shows it is 94% in the 'D' state, 3% in the 'S' state. == BACKGROUND == **SOFTWARE - PG 7.4.6, RedHat 8. -- **HARDWARE Xeon 2x2 2.4GHz 2GB RAM 4 x 73GB SCSI; pg_xlog and base on separate drives. -- **APPLICATION Six machines post batches of 10K messages to the PG db server. Machine #nn generates its ID keys as "nn001"::bigint etc. Each process runs: - "COPY tmp_message FROM STDIN" loads its own one-use TEMP table. - " INSERT INTO message SELECT tmp.* FROM tmp_message AS tmp JOIN m_reason ON m_reason.name = tmp.reason LEFT JOIN message USING (ID) WHERE message.ID is null (check required because crash recovery logic requires idempotent insert) "DROP TABLE tmp_message" --- call me paranoid, this is 7.4 The COPY step time is almost constant when #processes varies from 1 to 10. -- **POSTGRES pg_autovacuum is running with default parameters. Non-default GUC values: checkpoint_segments= 512 default_statistics_target = 200 effective_cache_size = 50 log_min_duration_statement = 1000 max_fsm_pages = 100 max_fsm_relations = 1000 random_page_cost = 1 shared_buffers = 1 sort_mem = 16384 stats_block_level = true stats_command_string = true stats_row_level= true vacuum_mem = 65536 wal_buffers= 2000 Wal_buffers and checkpoint_segments look outrageous, but were tuned for another process, that posts batches of 1 6KB rows in a single insert. -- TABLE/INDEX STATISTICS -- MACHINE STATISTICS ps gives the backend process as >98% in (D) state, with <1% CPU. A "top" snapshot: CPU states: cpuusernice systemirq softirq iowaitidle total2.0%0.0%0.8% 0.0% 0.0% 96.9%0.0% cpu002.5%0.0%1.9% 0.0% 0.0% 95.4%0.0% cpu011.7%0.0%0.1% 0.0% 0.3% 97.6%0.0% cpu020.5%0.0%0.7% 0.0% 0.0% 98.6%0.0% cpu033.1%0.0%0.5% 0.0% 0.0% 96.2%0.0% Mem: 2061552k av, 2041752k used, 19800k free, 0k shrd, 21020k buff iostat reports that the $PGDATA/base drive is being worked but not overworked. The pg_xlog drive is underworked: KBPS TPS KBPS TPS KBPS TPS KBPS TPS 12:30 1 276316 31 8 3336 269 12:40 5 3 115122 5 5 2705 320 ^pg_xlog^ ^base^ The base drive has run as much as 10MBPS, 5K TPS. -- EXPLAIN ANALYZE output: The plan is eminently reasonable. But there's no visible relationship between the top
Re: [PERFORM] multi-line copy (was: Re: COPY Hacks)
Quoting Greg Sabino Mullane <[EMAIL PROTECTED]>: > > Using COPY ... FROM STDIN via the Perl DBI (DBD::Pg) interface, > > I accidentally strung together several \n-terminated input lines, > > and sent them to the server with a single "putline". > ... > > So. Is it a feechur? Worth stress-testing? Could be VERY cool. > > As explained elsewhere, not really a feature, more of a side-effect. > Keep in mind, however, that any network round-trip time saved has to > be balanced against some additional overhead of constructing the > combined strings in Perl before sending them over. Most times COPY > is used to parse a newline-separated file anyway. If you have a slow > network connection to the database, it *might* be a win, but my > limited testing shows that it is not an advantage for a "normal" > connection: I added 1 million rows via COPY using the normal way > (1 million pg_putline calls), via pg_putline of 1000 rows at a > time, and via 10,000 rows at a time. They all ran in 22 seconds, > with no statistical difference between them. (This was the "real" time, > the system time was actually much lower for the combined calls). > > It can't hurt to test things out on your particular system and see > if it makes a real difference: it certainly does no harm as long as > you make sure the string you send always *end* in a newline. Many thanks for digging into it. For the app I'm working with, the time delay between rows being posted is /just/ enough to exceed the TCP Nagle delay, so every row goes across in its own packet :-( Reducing the number of network roundtrips by a factor of 40 is enough to cut elapsed time in half. The cost of join("",@FortyRows), which produces a 1-4K string, is what's negligible in this case. -- "Dreams come true, not free" -- S.Sondheim, ITW ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Recognizing range constraints (was Re: Plan for relatively simple query seems to be very inefficient)
Quoting Tom Lane <[EMAIL PROTECTED]>: > Yeah, the whole thing is only a heuristic anyway. I've been coming > around to the view that relation membership shouldn't matter, because > of cases like > > WHERE a.x > b.y AND a.x < 42 > > which surely should be taken as a range constraint. Out of curiosity, will the planner induce "b.y < 42" out of this? -- "Dreams come true, not free." ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
COPY Hacks (WAS: RE: [PERFORM] Postgresql vs SQLserver for this application ?)
This thread seems to be focusing in on COPY efficiency, I'd like to ask something I got no answer to, a few months ago. Using COPY ... FROM STDIN via the Perl DBI (DBD::Pg) interface, I accidentally strung together several \n-terminated input lines, and sent them to the server with a single "putline". To my (happy) surprise, I ended up with exactly that number of rows in the target table. Is this a bug? Is this fundamental to the protocol? Since it hasn't been documented (but then, "endcopy" isn't documented), I've been shy of investing in perf testing such mass copy calls. But, if it DOES work, it should be reducing the number of network roundtrips. So. Is it a feechur? Worth stress-testing? Could be VERY cool. -- "Dreams come true, not free." ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Plan for relatively simple query seems to be very inefficient
Quoting Arjen van der Meijden <[EMAIL PROTECTED]>: > Hi list, > > I noticed on a forum a query taking a surprisingly large amount of time > in MySQL. Of course I wanted to prove PostgreSQL 8.0.1 could do it much > better. To my surprise PostgreSQL was ten times worse on the same > machine! And I don't understand why. > > I don't really need this query to be fast since I don't use it, but the > range-thing is not really an uncommon query I suppose. So I'm wondering > why it is so slow and this may point to a wrong plan being chosen or > generated. > > Here are table definitions: > > Table "public.postcodes" > Column| Type | Modifiers > -+---+--- > postcode_id | smallint | not null > range_from | smallint | > range_till | smallint | > Indexes: > "postcodes_pkey" PRIMARY KEY, btree (postcode_id) > "range" UNIQUE, btree (range_from, range_till) > > Table "public.data_main" > Column | Type | Modifiers > +--+--- > userid | integer | not null > range | smallint | > Indexes: > "data_main_pkey" PRIMARY KEY, btree (userid) > > And here's the query I ran: > > SELECT COUNT(*) FROM > data_main AS dm, > postcodes AS p > WHERE dm.range BETWEEN p.range_from AND p.range_till I just posted an answer to this (via webcafe webmail; can't recall which pg-list), that might interest you. BTree indexes as they stand (multi-column, ...) answer what most people need for queries. Unfortunately, out-of-the-box, they have no good way of handling range queries. To compensate, you can use a small amount of kinky SQL. This is in the same line as the tricks used to implement hierarchic queries in relational SQL. [1] Create a table "widths"(wid int) of powers of 2, up to what will just cover max(range_till-range_from). Since your "range" column is a smallint, this table can have no more than 15 rows. You can get as fussy as you want about keeping this table to a minimum. [2] Change postcodes: ALTER TABLE postcodes ADD wid INT USING 2 ^ CEIL(LOG(range_from - range_till,2)); ALTER TABLE postcodes ADD start INT USING range_from - (range_from % wid); CREATE INDEX postcodes_wid_start_index ON (wid, start); ANALYZE postcodes; [4] Write your query as: SELECT COUNT(*) FROM data_main AS dm CROSS JOIN widths -- yes, CROSS JOIN. For once, it HELPS performance. JOIN postcodes AS p ON dm.wid = widths.wid AND dm.start = p.range - p.range % widths.wid WHERE dm.range BETWEEN p.range_from AND p.range_till This uses BTREE exact-match to make a tight restriction on which rows to check. YMMV, but this has worked even for multi-M table joins. -- "Dreams come true, not free." ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] date - range
Quoting Mischa <[EMAIL PROTECTED]>: [deleted] > SELECT People.* FROM People > JOIN Widths > ONPeople.start = today - today % Widths.width > AND People.width = Widths.width Yikes! I hit the SEND button one ohnosecend too fast. (1) You still ALSO have to test: ... AND today between first_date and last_date (2) On some SQL engines, it makes a different to how the engine can re-order the nested loops, if you make the index (width,start) instead of (start,width). Haven't tried on PG8 yet. -- "Dreams come true, not free." ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] date - range
Quoting "H.J. Sanders" <[EMAIL PROTECTED]>: > > Anybody a solution for the next problem: > people can subscribe to a service for 1 or more days (upto a max. of 365). > So in the database is stored: first_date and last_date > To select which people are subscribed for a certain date (e.g. today) we use > a select like > > select ... where first_date <= today and last_date >= today > > Whatever index we create system always does a sequential scan (which I can > understand). Has someone a smarter solution? Yep, standard SQL problem. The answer is sort of a hand-rolled GIST index. To save typing, I'm going to pretend all your dates are stored as integers. In reality, you'll probably be writing views with lots of EXTRACT(EPOCH...)'s in them, to achieve the same result. Suppose you have table People(id, first_date, last_date, ...) Each such range "fits" in some larger fixed range of 1,2,4, ... days that starts and ends on a fixed (epoch) date multiple of 1,2,4,... For example, if your range were days (1040..1080), then that fits in the 64-wide range (1024...1088]. You calculate the start and width of the range that just fits, and store that in People, too. Now, you index on (start,width). Now, when you want to query for a given "today", you have to try for all possible widths in People. Fortunately, that's darn few! The ranges up to a decade (!) will still mean only 16 different widths. A max range of one year (<512 days) means only 9 widths. You can do this with a tiny static table. Then: the query: SELECT People.* FROM People JOIN Widths ONPeople.start = today - today % Widths.width AND People.width = Widths.width Though this may look gross, it makes an index work where no normal BTree index would. I've used it for some really nasty data conversions of 100M-row tables. Your first name wouldn't be "Harlan", would it? :-) -- "Dreams come true, not free." ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[PERFORM] Multi-line requests in COPY ... FROM STDIN
I'm using a 7.4.6 Perl app that bulk-loads a table, by executing a "COPY TMP_Message FROM STDIN", then letting $dbh->func($message_text."\n", "putline") Speculation made me try catenating Several \n-terminated lines together, and making a single putline() call with that. Lo and behold, all the lines went in as separate rows, as I hoped. I haven't measured the performance difference using this multiline batching. I'm hoping that there will be as much,since the app is really sucking on a 500 msg/sec firehose, and the db side needs serious speeding up. Question is, am I playing with a version-dependent anomaly, or should I expect this to continue in 8.x (until, eventually, silently, something causes this to break)? I'm presuming that this is not a Perl DBI/DBD::Pg question, but rather, depending on the underlying pq lib and fe protocol. -- "Dreams come true, not free." ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] bad performances using hashjoin
Quoting Tom Lane <[EMAIL PROTECTED]>: > Klint Gore <[EMAIL PROTECTED]> writes: > > Is having an order by in a view legal? > > Not according to the SQL spec, but PG has allowed it for several releases. > (The same goes for ORDER BY in a sub-select, which is actually pretty > much the same thing ...) Umm... if you implement LIMIT in a subselect, it becomes highly meaningful (nad useful. Is this a case of one nonstandard feature being the thin edge of the wedge for another? -- "Dreams come true, not free." ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM]
No support for partitioned tables? Perhaps in name ... but I use a time-based "partition" tables that inherit from a base table; new partitions are "placed" (moved) round-robin on a set of drives. Somewhat manual, but if you really need a solution now, it works. Quoting Greg Stark <[EMAIL PROTECTED]>: > > "Matt Casters" <[EMAIL PROTECTED]> writes: > > > I've been reading up on partitioned tabes on pgsql, will the performance > > benefit will be comparable to Oracle partitioned tables? > > Postgres doesn't have any built-in support for partitioned tables. You can > do > it the same way people did it on Oracle up until 8.0 which is by creating > views of UNIONs or using inherited tables. > > The main advantage of partitioned tables is being able to load and drop data > in large chunks instantaneously. This avoids having to perform large deletes > and then having to vacuum huge tables to recover the space. > > However in Postgres you aren't going to get most of the performance > advantage > of partitions in your query plans. The Oracle planner can prune partitions > it > knows aren't relevant to the query to avoid having to search through them. > > This can let it get the speed of a full table scan without the disadvantage > of > having to read irrelevant tuples. Postgres is sometimes going to be forced > to > either do a much slower index scan or read tables that aren't relevant. > > -- > greg > > > ---(end of broadcast)--- > TIP 7: don't forget to increase your free space map settings > -- "Dreams come true, not free." ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Normal case or bad query plan?
This may sound more elaborate than it's worth, but I don't know of a better way to avoid a table scan. You want to index on a computed value that is a common prefix of your FROM and TO fields. The next step is to search on a fixed SET of prefixes of different lengths. For example, some of your ranges might be common in the first 3 bytes of ipaddr, some in two, some in only one. You create and index on one common prefix of either 1,2 or 3 bytes, for each row. Your query then looks something like (pardon my ignorance in PGSQL) select * fromip2location where ip2prefix in ( network(:myaddr || '/8'), network(:myaddr || '/16'), network(:myaddr || '/24'), :myaddr --- assuming single-address ranges are possible ) and :myaddr between ip_address_from and ip_address_to Although this looks a little gross, it hits very few records. It also adapts cleanly to a join between ip2location and a table of ip addrs. Gabriele Bartolini wrote: Hi guys, please consider this scenario. I have this table: CREATE TABLE ip2location ( ip_address_from BIGINT NOT NULL, ip_address_to BIGINT NOT NULL, id_location BIGINT NOT NULL, PRIMARY KEY (ip_address_from, ip_address_to) ); I created a cluster on its primary key, by running: CLUSTER ip2location_ip_address_from_key ON ip2location; This allowed me to organise data in a more efficient way: the data that is contained are ranges of IP addresses with empty intersections; for every IP class there is a related location's ID. The total number of entries is 1392443. For every IP address I have, an application retrieves the corresponding location's id from the above table, by running a query like: SELECT id_location FROM ip2location WHERE '1102111' >= ip_address_from AND '1102111' <= ip_address_to; For instance, by running the 'EXPLAIN ANALYSE' command, I get this "funny" result: QUERY PLAN - Seq Scan on ip2location (cost=0.00..30490.65 rows=124781 width=8) (actual time=5338.120..40237.283 rows=1 loops=1) Filter: ((1040878301::bigint >= ip_address_from) AND (1040878301::bigint <= ip_address_to)) Total runtime: 40237.424 ms With other data, that returns an empty set, I get: explain SELECT id_location FROM ip2location WHERE '1102111' >= ip_address_from AND '1102111' <= ip_address_to; QUERY PLAN --- Index Scan using ip2location_ip_address_from_key on ip2location (cost=0.00..419.16 rows=140 width=8) Index Cond: ((1102111::bigint >= ip_address_from) AND (1102111::bigint <= ip_address_to)) I guess the planner chooses the best of the available options for the first case, the sequential scan. This is not confirmed though by the fact that, after I ran "SET enable_scan TO off", I got this: QUERY PLAN Index Scan using ip2location_ip_address_from_key on ip2location (cost=0.00..31505.73 rows=124781 width=8) (actual time=2780.172..2780.185 rows=1 loops=1) Index Cond: ((1040878301::bigint >= ip_address_from) AND (1040878301::bigint <= ip_address_to)) Total runtime: 2780.359 ms Is this a normal case or should I worry? What am I missing? Do you have any suggestion or comment to do (that would be extremely appreciated)? Is the CLUSTER I created worthwhile or not? Thank you, -Gabriele -- Gabriele Bartolini: Web Programmer, ht://Dig & IWA/HWG Member, ht://Check maintainer Current Location: Prato, Toscana, Italia [EMAIL PROTECTED] | http://www.prato.linux.it/~gbartolini | ICQ#129221447 > "Leave every hope, ye who enter!", Dante Alighieri, Divine Comedy, The Inferno --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.773 / Virus Database: 520 - Release Date: 05/10/2004 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED]) ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] sequential scan on select distinct
Tom Lane wrote: Ole Langbehn <[EMAIL PROTECTED]> writes: What do you think about the idea of an "UniqueSort" which would do sort+unique in one pass ? This is what oracle does and it is quite fast with it... Hashing is at least as fast, if not faster. regards, tom lane I got good mileage in a different SQL engine, by combining the hash-aggregate and sort nodes into a single operator. The hash table was just an index into the equivalent of the heap used for generating runs. That gave me partially aggregated data, or eliminated duplicate keys, without extra memory overhead of the hash-aggregation node below the sort. Memory was scarce then ... :-) BTW I'm really puzzled that Oracle is pushing 'index skip scan' as a new feature. Wasn't this in the original Oracle Rdb --- one of Gennady Antoshenkov's tweaks? ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[PERFORM] Tryint to match Solaris-Oracle performance with directio?
Our product (Sophos PureMessage) runs on a Postgres database. Some of our Solaris customers have Oracle licenses, and they've commented on the performance difference between Oracle and Postgresql on such boxes. In-house, we've noticed the 2:1 (sometimes 5:1) performance difference in inserting rows (mostly 2-4K), between Postgresql on Solaris 8 and on Linux, for machines with comparable CPU's and RAM. These (big) customers are starting to ask, why don't we just port our dataserver to Oracle for them? I'd like to avoid that, if possible :-) What we can test on, in-house are leetle Sun workstations, while some of our customers have BIG Sun iron --- so I have no means to-date to reproduce what their bottleneck is :-( Yes, it has been recommended that we talk to Sun about their iForce test lab ... that's in the pipe. In the meantime, what I gather from browsing mail archives is that postgresql on Solaris seems to get hung up on IO rather than CPU. Furthermore, I notice that Oracle and now MySQL use directio to bypass the system cache, when doing heavy writes to the disk; and Postgresql does not. Not wishing to alter backend/store/file for this test, I figured I could get a customer to mount the UFS volume for pg_xlog with the option "forcedirectio". Any comment on this? No consideration of what the wal_sync_method is at this point. Presumably it's defaulting to fdatasync on Solaris. BTW this is Postgres 7.4.1, and our customers are Solaris 8 and 9. ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres --
Googling 'upsert' (an Oraclism, I believe) will get you hits on Oracle and DB2's implementation of MERGE, which does what AMOUNTS to what is described below (one mass UPDATE...FROM, one mass INSERT...WHERE NOT EXISTS). No, you shouldn't iterate row-by-row through the temp table. Whenever possible, try to do updates in one single (mass) operation. Doing it that way gives the optimizer the best chance at amortizing fixed costs, and batching operations. - In any database other than Postgres, I would recommend doing the INSERT /followed by/ the UPDATE. That order looks wonky --- your update ends up pointlessly operating on the rows just INSERTED. The trick is, UPDATE acquires and holds write locks (the rows were previously visible to other processes), while INSERT's write locks refer to rows that no other process could try to lock. Stephen Frost wrote: * Markus Schaber ([EMAIL PROTECTED]) wrote: Generally, what is the fastest way for doing bulk processing of update-if-primary-key-matches-and-insert-otherwise operations? This is a very good question, and I havn't seen much of an answer to it yet. I'm curious about the answer myself, actually. In the more recent SQL specs, from what I understand, this is essentially what the 'MERGE' command is for. This was recently added and unfortunately is not yet supported in Postgres. Hopefully it will be added soon. Otherwise, what I've done is basically an update followed by an insert using outer joins. If there's something better, I'd love to hear about it. The statements looks something like: update X set colA = a.colA, colB = a.colB from Y a where keyA = a.keyA and keyB = a.keyB; insert into X select a.keyA, a.keyB, a.colA, a.colB from Y a left join X b using (keyA, keyB) where b.keyA is NULL and b.keyB is NULL; With the appropriate indexes, this is pretty fast but I think a merge would be much faster. Thanks, Stephen ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables
Simon Riggs wrote: Jim C. Nasby On Mon, Sep 13, 2004 at 11:07:35PM +0100, Simon Riggs wrote: PostgreSQL's functionality is in many ways similar to Oracle Partitioning. Loading up your data in many similar tables, then creating a view like: CREATE VIEW BIGTABLE (idate, col1, col2, col3...) AS SELECT 200409130800, col1, col2, col3... FROM table200409130800 UNION ALL SELECT 200409131000, col1, col2, col3... FROM table200409131000 UNION ALL SELECT 200409131200, col1, col2, col3... FROM table200409131200 ...etc... will allow the PostgreSQL optimizer to eliminate partitions from the query when you run queries which include a predicate on the partitioning_col, e.g. select count(*) from bigtable where idate >= 200409131000 The "partitions" are just tables, so no need for other management tools. Oracle treats the partitions as sub-tables, so you need a range of commands to add, swap etc the partitions of the main table. A few years ago I wrote a federated query engine (wrapped as an ODBC driver) that had to handle thousands of contributors (partitions) to a pseudotable / VIEWofUNIONs. Joins did require some special handling in the optimizer, because of the huge number of crossproducts between different tables. It was definitely worth the effort at the time, because you need different strategies for: joining a partition to another partition on the same subserver; joining two large partitions on different servers; and joining a large partition on one server to a small one on another. The differences may not be so great for a solitary server; but they're still there, because of disparity in subtable sizes. The simplistic query plans tend to let you down, when you're dealing with honking warehouses. I'm guessing that Oracle keeps per-subtable AND cross-all-subtables statistics, rather than building the latter from scratch in the course of evaluating the query plan. That's the one limitation I see in emulating their partitioned tables with Views. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Help with extracting large volumes of records across related
Damien Dougan wrote: Basically we have a number of tables, which are exposed as 2 public views (say PvA and PvB). For each row in PvA, there are a number of related rows in PvB (this number is arbitrary, which is one of the reasons why it cant be expressed as additional columns in PvA - so we really need 2 sets of tables - which leads to two sets of extract calls - interwoven to associate PvA with PvB). Are there any tools/tricks/tips with regards to extracting large volumes of data across related tables from Postgres? It doesnt have to export into XML, we can do post-processing on the extracted data as needed - the important thing is to keep the relationship between PvA and PvB on a row-by-row basis. Just recently had to come up with an alternative to MSSQL's "SQL..FOR XML", for some five-level nested docs, that turned out to be faster (!) and easier to understand: Use SQL to organize each of the row types into a single text field, plus a single key field, as well as any filter fields you . Sort the union, and have the reading process break them into documents. For example, if PvA has key (account_id, order_id) and fields(order_date, ship_date) and PvB has key (order_id, product_id) and fields (order_qty, back_order) CREATE VIEW PvABxml AS SELECT account_id::text + order_id::text AS quay ,'order_date="' + order_date::text + '" ship_date="' + ship_date::text + '"' AS info ,ship_date FROMPvA UNION ALL SELECT account_id::text + order_id::text + product_id::text ,'order_qty="' + order_qty::text +'"' ,ship_date FROMPvA JOIN PvB USING (order_id) Then: SELECT quay, info FROM pvABxml WHERE ship_date = '...' ORDER BY quay gives you a stream of info in the (parent,child,child... parent,child,child...) order you want, that assemble very easily into XML documents. If you need to pick out, say, orders where there are backordered items, you probably need to work with a temp table with which to prefilter. ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables
Mark Cotner wrote: Hi all, I had a difficult time deciding which list to post this to, so please forgive me if this list doesn't perfectly match my questions. My decision will not solely be based on performance, but it is the primary concern. I would be very appreciative if you all could comment on my test plan. Real world examples of a Postgres implementation of >=600G with a web front-end would be great, or any data warehouse with some size to it. I'm only in the 30GB range of database, in case that's a consideration for my comments that follow. At this time, I'm working out the best ROLAP options for our PG transaction store. The transaction store is highly volatile -- longest a txn stays in it is 15 days ... so you imagine the need for historic summaries :-) I've also implemented multiple data servers, including a federated server that had to make the best of existing engines (like MySQL, PG and everything from MSJet to RedBrick in the commercial world). The time has come to reevaluate/rearchitect an application which I built about 3 years ago. There are no performance concerns with MySQL, but it would benefit greatly from stored procedures, views, etc. If your company is currently happy with MySQL, there probably are other (nontechnical) reasons to stick with it. I'm impressed that you'd consider reconsidering PG. Some of the mining that we do could benefit from stored procedures as well. MySQL may have these in the works, but we won't be able to move to a version of MySQL that supports stored procs for another year or two. And PG lets you back-end with some powerful pattern- and aggregate-handling languages, like Perl. This was definitely a plus for data mining of web traffic, for example. The power of server-side extensibility for bailing you out of a design dead-end is not inconsequential. PG doesn't have PIVOT operators (qv Oracle and MSSQL), but it makes the translation from data to column fairly painless otherwise. Requirements: Merge table definition equivalent. We use these extensively. Looked all over mysql.com etc, and afaics merge table is indeed exactly a view of a union-all. Is that right? PG supports views, of course, as well (now) as tablespaces, allowing you to split tables/tablesets across multiple disk systems. PG is also pretty efficient in query plans on such views, where (say) you make one column a constant (identifier, sort of) per input table. Merge table equivalent with all tables containing over 100M rows(and about 40 columns, some quite wide) will need to do index scans in at least 5 seconds(MySQL currently does 2, but we can live with 5) and return ~200 rows. PG has TOAST for handling REALLY BIG columns, and the generic TEXT type is as efficient as any size-specific VARCHAR() type ... should make things easier for you. Um, gonna sound silly, but the web interface has to remain "snappy" under load. I don't see this as a major concern since you don't require table locking. Agreed. It's more in your warehouse design, and intelligent bounding of queries. I'd say PG's query analyzer is a few years ahead of MySQL for large and complex queries. If business logic is moved to the database(likely with Postgres) performance for inserting with light logic on each insert has to keep up with the 4.5M inserts per 2 hours(which MySQL completes in ~35min currently). Acceptable numbers for this aggregation would be 45-55min using stored procedures. Again, it's a matter of pipeline design. The tools for creating an efficient pipeline are at least as good in PG as MySQL. If you try to insert and postprocess information one row at a time, procedures or no, there's no offhand way to guarantee your performance without a test/prototype. On the other hand, if you do warehouse-style loading (Insert, or PG COPY, into a temp table; and then 'upsert' into the perm table), I can guarantee 2500 inserts/sec is no problem. Here's our case study if you're interested . . . http://www.mysql.com/customers/customer.php?id=16 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Equivalent praxis to CLUSTERED INDEX?
Mischa Sandberg wrote: Coming from the MSSQL world, I'm used to the first step in optimization to be, choose your clustered index and choose it well. I see that PG has a one-shot CLUSTER command, but doesn't support continuously-updated clustered indexes. What I infer from newsgroup browsing is, such an index is impossible, given the MVCC versioning of records (happy to learn I'm wrong). I'd be curious to know what other people, who've crossed this same bridge from MSSQL or Oracle or Sybase to PG, have devised, faced with the same kind of desired performance gain for retrieving blocks of rows with the same partial key. Just to let people know, after trying various options, this looks the most promising: - segment the original table into four tables (call them A,B,C,D) - all insertions go into A. - longterm data lives in B. - primary keys of all requests to delete rows from (B) go into D -- no actual deletions are done against B. Deletions against A happen as normal. - all queries are made against a view: a union of A and B and (not exists) D. - daily merge A,B and (where not exists...) D, into C - run cluster on C, then swap names on B and C, truncate A and D. Not rocket science, but it seems to give the payback of normal clustering without locking the table for long periods of time. It also saves on VACUUM FULL time. At present, we're only at 1M rows in B on this. More when I know it. Advance warning on any gotchas with this approach would be much appreciated. Making a complete copy of (B) is a bit of an ouch. ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Equivalent praxis to CLUSTERED INDEX?
J. Andrew Rogers wrote: On Thu, 2004-08-26 at 12:30, Magnus Hagander wrote: IIRC, SQL Server always creates clustered indexes by default for primary keys. That would surprise me actually. Yaz, it should. It doesn't ALWAYS create clustered (unique) index for primary keys, but clustered is the default if you just specify CREATE TABLE Foo (col1, ... ,PRIMARY KEY(col1, ...) ) Saying PRIMARY KEY NONCLUSTERED(...) is how you override the default. ((Weird to be discussing so much MSSQL here)) ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Equivalent praxis to CLUSTERED INDEX?
Sheer nitpick here... A B-tree is where the records (data) live at all levels of the tree; B+ tree is where the records are only at the leaf level. That's what Knuth calls them, anyway. Clustered indexes for all known dbs are true B+ trees. Nonclustered indexes could be B-trees (probably aren't), since there's no big fanout penalty for storing the little (heap) row locators everywhere at all levels. J. Andrew Rogers wrote: As far as I know, Oracle does it by having a B-Tree organized heap (a feature introduced around v8 IIRC), basically making the primary key index and the heap the same physical structure. ... ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] Equivalent praxis to CLUSTERED INDEX?
This discussion is starting to sound like the split in HEAP memory management evolution, into garbage-collecting (e.g. Java) and non-garbage-collecting (e.g. C++). Reclamation by GC's these days has become seriously sophisticated. CLUSTER resembles the first generation of GC's, which were single-big-pass hold-everything-else threads. Perhaps the latest in incremental GC algorithms would be worth scouting, for the next step in PG page management. Greg Stark wrote: Bruce Momjian <[EMAIL PROTECTED]> writes: but is there any significant performance benefit to doing that which would offset the compaction advantage? Just as a side comment. Setting PCTFREE 0 PCTUSED 100 on tables that have no updates on them has an astonishingly big effect on speed. So the penalty for leaving some space free really is substantial. I think the other poster is right. Oracle really needs pctfree because of the way it handles updates. Postgres doesn't really need as much because it doesn't try to squeeze the new tuple in the space the old one took up. If it doesn't fit on the page the worst that happens is it has to store it on some other page, whereas oracle has to do its strange row chaining thing. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Equivalent praxis to CLUSTERED INDEX?
I think you've probably fingered the kicker of why PG doesn't have this kind of clustering already. Hence perhaps the need for other approaches to the issue (the disk-IO efficiency of reading groups of rows related by a common key) that other DB's (with in-place update) address with synchronous clustering ('heap rebalancing' ?). Bruce Momjian wrote: Adi Alurkar wrote: IIRC it it to reduce the "overflow" of data or what oracle calls chained rows. i.e if a table has variable length columns and 10 rows get inserted into a datapage, if this datapage is full and one of the variable length field gets updated the row will now "overflow" into another datapage, but if the datapage is created with an appropriate amount of free space the updated row will be stored in one single datapage. Agreed. What I am wondering is with our system where every update gets a new row, how would this help us? I know we try to keep an update on the same row as the original, but is there any significant performance benefit to doing that which would offset the compaction advantage? ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Equivalent praxis to CLUSTERED INDEX?
Ummm ... not quite. In MSSQL/Sybase/Oracle, a clustered index maintains its space saturation as part of each update operation. High activity does indeed result in less-full pages (typically 60-80% full for tables with heavy deletions or rowsize changes). To bring the percentage back up, you run DBCC INDEXDEFRAG, which also does what you'd expect of a normal file defragmenter -- put related disk pages together on the platter. But the performance difference is hardly as severe as I gather it can be if you neglect to vacuum. As for SQL Server being a 'single-user database' ... ummm ... no, I don't think so. I'm REALLY happy to be shut of the Microsoft world, but MSSQL 7/2000/2005 is a serious big DB engine. It also has some serious bright heads behind it. They hired Goetz Graefe and Paul (aka Per-Ake) Larsen away from academia, and it shows, in the join and aggregate processing. I'll be a happy camper if I manage to contribute something to PG that honks the way their stuff does. Happy to discuss, too. Josh Berkus wrote: Bruce, How do vendors actually implement auto-clustering? I assume they move rows around during quiet periods or have lots of empty space in each value bucket. That's how SQL Server does it. In old versions (6.5) you had to manually send commands to update the cluster, same as PG. Also, when you create a cluster (or an index or table for that matter) you can manually set an amount of "space" to be held open on each data page for updates. Also keep in mind that SQL Server, as a "single-user database" has a much easier time with this. They don't have to hold several versions of an index in memory and collapse it into a single version at commit time. All that being said, we could do a better job of "auto-balancing" clustered tables. I believe that someone was working on this in Hackers through what they called "B-Tree Tables". What happened to that? ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] Equivalent praxis to CLUSTERED INDEX?
Coming from the MSSQL world, I'm used to the first step in optimization to be, choose your clustered index and choose it well. I see that PG has a one-shot CLUSTER command, but doesn't support continuously-updated clustered indexes. What I infer from newsgroup browsing is, such an index is impossible, given the MVCC versioning of records (happy to learn I'm wrong). I'd be curious to know what other people, who've crossed this same bridge from MSSQL or Oracle or Sybase to PG, have devised, faced with the same kind of desired performance gain for retrieving blocks of rows with the same partial key. ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] my boss want to migrate to ORACLE
Regarding Raid5 at all, you might want to look at http://www.baarf.com ""Stephane Tessier"" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > I think with your help guys I'll do it! > > I'm working on it! > > I'll work on theses issues: > > we have space for more ram(we use 2 gigs on possibility of 3 gigs) > iowait is very high 98% --> look like postgresql wait for io access > raid5 -->raid0 if i'm right raid5 use 4 writes(parity,data, etc) for each ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Inverted-list databases (was: Working on huge RAM based datasets)
""Andy Ballingall"" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > On another thread, (not in this mailing list), someone mentioned that there > are a class of databases which, rather than caching bits of database file > (be it in the OS buffer cache or the postmaster workspace), construct a a > well indexed memory representation of the entire data in the postmaster > workspace (or its equivalent), and this, remaining persistent, allows the DB > to service backend queries far quicker than if the postmaster was working > with the assumption that most of the data was on disk (even if, in practice, > large amounts or perhaps even all of it resides in OS cache). As a historical note, System R (grandaddy of all relational dbs) worked this way. And it worked under ridiculous memory constraints by modern standards. Space-conscious MOLAP databases do this, FWIW. Sybase 11 bitmap indexes pretty much amount to this, too. I've built a SQL engine that used bitmap indexes within B-Tree indexes, making it practical to index every field of every table (the purpose of the engine). You can also build special-purpose in-memory representations to test for existence (of a key), when you expect a lot of failures. Google "superimposed coding" e.g. http://www.dbcsoftware.com/dbcnews/NOV94.TXT ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Query performance
Usually, when you post a request like this, you should provide something a little more concrete (the CREATE TABLE statement for that table, with Since you didn't, I'll posit something that sounds like what you're using, and take a stab at your problem. TABLE Prices ( stock VARCHAR(9) ,asof DATE, ,opening MONEY ,closing MONEY ,PRIMARY KEY (stock, asof) ) SELECT stock, AVG((closing-opening)/opening) as ratio FROM Prices GROUP BY stock ORDER BY ratio DESC LIMIT 10; -- top 10 best-performing stocks. ""Bill"" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED]... Actually, I have some queries that are slow, however I was wondering if you could help me write a query that is rather simple, but I, as a true database novice, can't seem to conjure. So we have stocks, as I have previously said, and I have a huge table which contains all of the opening and closing prices of some stocks from each day. What I like to do, in English, for each stock in each day is find a ratio: abs(closing-opening)/opening. Then I would like to average all of the ratios of each day of each individual stock together to find a final ratio for each stock, then I would like to find the highest average, to find the best performing stock. So what query can I use, and (as is appropriate for this group), how can it be optimized to run the fastest?
[PERFORM] Range query optimization
I'm trying to make a (qua-technical, qua-business) case for switching from MS SQL, and one of the types of query that really doesn't sit well with MS SQL2K is: -- All fields integers or equivalent. -- Table T(k, x: nonkey fields...) -- Table U(k, a, z: m)-- for each value of (k) a set of non-intersecting ranges [a,z) that map to (m) values. select T.*, U.m from T join U on T.k=U.k and T.x >= U.a and T.x < U.z Typically there are are about 1000-2000 U rows per value of (k), about 100K values of (k) and about 50M values of T. By itself, this type of query grinds the CPU to dust. A clustered index on fields of U (take your pick) barely halves the problem of the loop through 1000-2000 rows of U for each row of T. Hash join likewise. The current workaround is a 'manual' radix index on top of the range table, but it's something of a hack. Would the geometric of extensions handle such queries efficiently? I'm not familiar with applying R-trees to linear range problems. "Dreams come true, not free." -- S.Sondheim ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Major differences between oracle and postgres performance - what can I do ?
Don't know about Oracle, but select-distinct in MSSQL2K will indeed throw away duplicates, which chops the CPU time. Very easy to see in the graphic query plan, both in terms of CPU and the number of rows retrieved from a single-node or nested-loop subtree. Definitely a worthwhile optimization. "Tom Lane" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Stephen Frost <[EMAIL PROTECTED]> writes: > > * Tom Lane ([EMAIL PROTECTED]) wrote: > >> [... thinks for awhile ...] It seems possible that they may use sort > >> code that knows it is performing a DISTINCT operation and discards > >> duplicates on sight. Given that there are only 534 distinct values, > >> the sort would easily stay in memory if that were happening. > > > Could this optimization be added to PostgreSQL? It sounds like a very > > reasonable thing to do. > > That's what I was wondering about too. But first I'd like to get > some kind of reading on how effective it would be. If someone can > demonstrate that Oracle can do sort-and-drop-dups a lot faster than > it can do a straight sort of the same amount of input data, that > would be a strong indication that it's worth doing. At this point > we don't know if that's the source of their win or not. > > regards, tom lane > > ---(end of broadcast)--- > TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED] > ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]