[HACKERS] Should mdxxx functions(e.g. mdread, mdwrite, mdsync etc) PANIC instead of ERROR when I/O failed?
Recently, when I was running my application on 8.3.7, my data got corrupted. The scene was like this: invalid memory alloc request size I invested the error data, and found that one sector of a db-block became all-zero (I confirmed the reason later, it was because that my disk got bad). I also checked the log of postmaster, and I found that there were 453 ERROR messages that said could not read block XXX of relation XXX: ??, where XXX was the db-block that the bad sector resided in. After these 453 failed read operations, postmaster read successed, but got an all-zero sector! (I don't know why operating system will allow this happen, but it just happened) My question is: should not mdxxx functions(e.g. mdread, mdwrite, mdsync) just report PANIC instead of ERROR when I/O failed? IMO, since the data has already corrupted, reporting ERROR will just leave us a very curious scene later -- which does more harm that benefit. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should mdxxx functions(e.g. mdread, mdwrite, mdsync etc) PANIC instead of ERROR when I/O failed?
On Mon, Jun 15, 2009 at 04:41:42PM +0800, Jacky Leng wrote: My question is: should not mdxxx functions(e.g. mdread, mdwrite, mdsync) just report PANIC instead of ERROR when I/O failed? IMO, since the data has already corrupted, reporting ERROR will just leave us a very curious scene later -- which does more harm that benefit. I think the reasoning is that if those functions reported a PANIC the chance you could recover your data is zero, because you need the database system to read the other (good) data. With an ERROR you can investigate the problem and save what can de saved... Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ Please line up in a tree and maintain the heap invariant while boarding. Thank you for flying nlogn airlines. signature.asc Description: Digital signature
Re: [HACKERS] machine-readable explain output
I wrote: On Sun, Jun 14, 2009 at 1:04 PM, Robert Haasrobertmh...@gmail.com wrote: On Sun, Jun 14, 2009 at 1:02 PM, Tom Lanet...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Sun, Jun 14, 2009 at 11:28 AM, Tom Lanet...@sss.pgh.pa.us wrote: However, using BuildTupleFromCStrings is wasteful/stupid for *both* text and xml output, so it seems like getting rid of it is the thing to do here. Makes sense. However, if we just make that change in do_tup_output(), then we'll break the ability to use that function for non-text datatypes. I'd envision it taking Datums, so it doesn't really matter. However, as you say, specializing it to text only wouldn't be much of a loss. I like the Datum option, so I'll work up a patch for that, unless you want to just do it and spare me the trouble. :-) Here's an attempt. Is this anything like what you had in mind? Hmm... on further review, I'm thinking this is still a bit wastful, because we don't really need (I think) to call TupleDescGetAttInMetadata from begin_tup_output_tupdesc. But I'm not sure what the best way is to avoid that. Any thoughts? ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] char() overhead on read-only workloads not so insignifcant as the docs claim it is...
Comments? On Sat, Jun 13, 2009 at 3:44 PM, Stefan Kaltenbrunner ste...@kaltenbrunner.cc wrote: I'm currently doing some benchmarking on a Nehalem box( http://www.kaltenbrunner.cc/blog/index.php?/archives/26-Benchmarking-8.4-Chapter-1Read-Only-workloads.html) with 8.4 and while investigating what looks like issues in pgbench I also noticed that using char() has more than a negligable overhead on some (very special) readonly(!) workloads. for example running sysbench in read-only mode against 8.4 results in a profile(for the full run) that looks similiar to: samples %symbol name 981690 11.0656 bcTruelen 3591834.0487 index_getnext 3111283.5070 AllocSetAlloc 2723303.0697 hash_search_with_hash_value 2581572.9099 LWLockAcquire 1956732.2056 _bt_compare 1903032.1451 slot_deform_tuple 1681011.8948 PostgresMain 1641911.8508 _bt_checkkeys 1261101.4215 FunctionCall2 1239651.3973 SearchCatCache 1206291.3597 LWLockRelease the default sysbench mode actually uses a number of different queries and the ones dealing with char() are actually only a small part of the full set of queries sent. The specific query is causing bcTruelen to show up in the profile is: SELECT c from sbtest where id between $1 and $2 order by c where the parameters are for example $1 = '5009559', $2 = '5009658' - ie ranges of 100. benchmarking only that query results in: samples %symbol name 2148182 23.5861 bcTruelen 3694634.0565 index_getnext 3627843.9832 AllocSetAlloc 2841983.1204 slot_deform_tuple 1852792.0343 _bt_checkkeys 1801191.9776 LWLockAcquire 1727331.8965 appendBinaryStringInfo 1441581.5828 internal_putbytes 1410401.5486 AllocSetFree 1380931.5162 printtup 1242551.3643 hash_search_with_hash_value 1170541.2852 heap_form_minimal_tuple at around 46000 queries/s changing the fault sysbench schema from: Table public.sbtest Column | Type | Modifiers ++- id | integer| not null default nextval('sbtest_id_seq'::regclass) k | integer| not null default 0 c | character(120) | not null default ''::bpchar pad| character(60) | not null default ''::bpchar Indexes: sbtest_pkey PRIMARY KEY, btree (id) k btree (k) to Table public.sbtest Column | Type| Modifiers +---+- id | integer | not null default nextval('sbtest_id_seq'::regclass) k | integer | not null default 0 c | character varying | not null default ''::character varying pad| character(60) | not null default ''::bpchar Indexes: sbtest_pkey PRIMARY KEY, btree (id) k btree (k) results in a near 50%(!) speedup in terms of tps to around 67000 queries/s. This is however an extreme case because the c column actually contains no data at all (except for an empty string). the profile for the changed testcase looks like: 4307975. index_getnext 3967504.8095 AllocSetAlloc 3455084.1883 slot_deform_tuple 2282222.7666 appendBinaryStringInfo 2277662.7610 _bt_checkkeys 1938182.3495 LWLockAcquire 1799252.1811 internal_putbytes 1688712.0471 printtup 1520261.8429 AllocSetFree 1463331.7739 heap_form_minimal_tuple 1443051.7493 FunctionCall2 1283201. hash_search_with_hash_value at the very least we should reconsider this part of our docs: There is no performance difference between these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. from http://www.postgresql.org/docs/8.4/static/datatype-character.html regards Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Lets call it Postgres EnterpriseDB http://www.enterprisedb.com gurjeet[.sin...@enterprisedb.com singh.gurj...@{ gmail | hotmail | indiatimes | yahoo }.com Mail sent from my BlackLaptop device
Re: [HACKERS] machine-readable explain output
Robert Haas robertmh...@gmail.com writes: Hmm... on further review, I'm thinking this is still a bit wastful, because we don't really need (I think) to call TupleDescGetAttInMetadata from begin_tup_output_tupdesc. But I'm not sure what the best way is to avoid that. Any thoughts? Er, just don't do it? We shouldn't need it if the function is doing heap_form_tuple directly. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should mdxxx functions(e.g. mdread, mdwrite, mdsync etc) PANIC instead of ERROR when I/O failed?
Martijn van Oosterhout klep...@svana.org writes: On Mon, Jun 15, 2009 at 04:41:42PM +0800, Jacky Leng wrote: My question is: should not mdxxx functions(e.g. mdread, mdwrite, mdsync) just report PANIC instead of ERROR when I/O failed? IMO, since the data has already corrupted, reporting ERROR will just leave us a very curious scene later -- which does more harm that benefit. I think the reasoning is that if those functions reported a PANIC the chance you could recover your data is zero, because you need the database system to read the other (good) data. Also, in the case you're complaining about, the problem was that there wasn't any O/S error report that we could have PANIC'd about anyhow. But Martijn is correct that a PANIC here would reduce the system's overall stability without any clear benefit. We already do refuse to read a page into shared buffers if there's a read error on it, so it's not clear to me how you think that an ERROR leaves things in an unstable state. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Suppressing occasional failures in copy2 regression test
On Sun, Jun 14, 2009 at 12:39 AM, Robert Haas robertmh...@gmail.com wrote: On Sat, Jun 13, 2009 at 2:48 PM, Tom Lanet...@sss.pgh.pa.us wrote: Greg Stark greg.st...@enterprisedb.com writes: I'm not sure about that. It seems like race conditions with autovacuum are a real potential bug that it would be nice to be testing for. It's not a bug; it's a limitation of our testing framework that it sees this as a failure. Serious testing for autovac race conditions would indeed be interesting, but you're never going to get anything meaningful in that direction out of the current framework. The elephant in the room here may be moving to some more flexible/powerful testing framework, but the difficulty will almost certainly be in agreeing what it should look like. The actual writing of said test framework will take some work too, but to some degree that's a SMOP. This tuple-ordering issue seems to be one that comes up over and over again, but in the short term, making it a TEMP table seems like a reasonable fix. I am forwarding a mail perl script and a pair of sample files that I developed about an year ago. The forwarded mail text explains what the script is trying to do. A line beginning with '?' in the expected file is treated specially. If a line begins with '?' then the rest of the line is treated as a regular expression which will be used to match the corresponding line from the actual output. If '?' is immediately followed by the word 'unordered' all the lines till a line containing '?/unordered' are buffered and compared against corresponding lines from the result file ignoring the order of the result lines. Although we at EnterpriseDB have resolved the issues by alternate files etc., and do not use this script, I think it might be useful for community regression tests. Best regards, -- Forwarded message -- From: Gurjeet Singh gurjeet.si...@enterprisedb.com Date: Fri, Aug 8, 2008 at 1:45 AM Subject: neurodiff: a new diff utility for our regression test suites Hi All, PFA a perl script that implements a new kind of comparison, that might help us in situations like we have encountered with differeing plan costs in the hints patch recently. This script implements two new kinds of comparisons: i) Regular Expression (RE) based comparison, and ii) Comparison of unordered group of lines. The input for this script, just like regular diff, are two files, one expected output and one the actual output. The lines in the expected output file which are expected to have any kind of variability should start with a '?' character followed by an RE that line should match. For example, if we wish to compare a line of EXPLAIN output, that has the cost component too, then it might look like: ?Index Scan using accounts_i1 on accounts \(cost=\d+\.\d+\.\.\d+\.\d+ rows=\d+ width=\d+\) The above RE would help us match any line that matches the pattern, such as: Index Scan using accounts_i1 on accounts (cost=0.00..8.28 rows=1 width=106) or Index Scan using accounts_i1 on accounts (cost=1000...2000.20008 rows=1 width=1000) Apart from this, the SQL standard does not guarantee any order of results unless the query has an explicit ORDER BY clause. We often encounter cases in our result files where the output differs from the expected only in the order of the result. To bypass this effect, and to keep the 'diff' quiet, I have seen people invariably add an ORDER BY clause to the query, and modify the expected file accordingly. There is a remote possibility of the ORDER BY clause masking an issue/bug that would have otherwise shown up in the diffs or might have caused the crash. Using this script we can put special markers in the expected output, that denote the boundaries of a set of lines, that are expected to be produced in an undefined order. The script would not complain unless there's an actual missing or extra line in the output. Suppose that we have the following result-set to compare: 4 | JACK 5 | CATHY 2 | SCOTT 1 | KING 3 | MILLER The expected file would look like this: ?unordered 1 | KING 2 | SCOTT ?\d \| MILLER 4 | JACK 5 | CATHY ?/unordered This expected file will succeed for both the following variations of the result-sets too: 5 | CATHY 4 | JACK 3 | MILLER 2 | SCOTT 1 | KING or 1 | KING 4 | JACK 3 | MILLER 2 | SCOTT 5 | CATHY Also, as shown in the above example, the RE based matching works for the lines within the 'unordered' set too. The beauty of this approach for testing pattern matches and unordered results is that we don't have to modify the test cases in any way, just need to make adjustments in the expected output files. I am no perl guru, so I definitely see a lot of performance/semantic improvements possible (perl gurus, take a stab); and maybe thats the reason the script looks more like a C program than a whacky perl script full of ~!$^ and whatnot. This script
Re: [HACKERS] Suppressing occasional failures in copy2 regression test
On Mon, Jun 15, 2009 at 07:41:29PM +0530, Gurjeet Singh wrote: On Sun, Jun 14, 2009 at 12:39 AM, Robert Haas robertmh...@gmail.com wrote: Greg Stark greg.st...@enterprisedb.com writes: I'm not sure about that. It seems like race conditions with autovacuum are a real potential bug that it would be nice to be testing for. It's not a bug; it's a limitation of our testing framework that it sees this as a failure. Serious testing for autovac race conditions would indeed be interesting, but you're never going to get anything meaningful in that direction out of the current framework. The elephant in the room here may be moving to some more flexible/powerful testing framework, but the difficulty will almost certainly be in agreeing what it should look like. The actual writing of said test framework will take some work too, but to some degree that's a SMOP. This tuple-ordering issue seems to be one that comes up over and over again, but in the short term, making it a TEMP table seems like a reasonable fix. I am forwarding a mail perl script and a pair of sample files that I developed about an year ago. The forwarded mail text explains what the script is trying to do. A line beginning with '?' in the expected file is treated specially. If a line begins with '?' then the rest of the line is treated as a regular expression which will be used to match the corresponding line from the actual output. If '?' is immediately followed by the word 'unordered' all the lines till a line containing '?/unordered' are buffered and compared against corresponding lines from the result file ignoring the order of the result lines. Although we at EnterpriseDB have resolved the issues by alternate files etc., and do not use this script, I think it might be useful for community regression tests. I looked over the program. Here are a few observations: * use strict; Good! * Fails perlcritic http://search.cpan.org/~elliotjs/Perl-Critic-1.098/bin/perlcritic at level 5: Bad! * It's not using Tie::File http://perldoc.perl.org/Tie/File.html. Probably bad. * It's going to a lot of trouble to allow for the possibility of both unordered results and of duplicate lines. If we disallow duplicate lines in unordered result sets, we can get a big speed gain by using hash-based comparisons. * Thanks, Gurjeet! :) Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Suppressing occasional failures in copy2 regression test
David Fetter da...@fetter.org writes: * It's going to a lot of trouble to allow for the possibility of both unordered results and of duplicate lines. If we disallow duplicate lines in unordered result sets, we can get a big speed gain by using hash-based comparisons. Why not just sort the lines and compare? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Suppressing occasional failures in copy2 regression test
On Mon, Jun 15, 2009 at 11:34:38AM -0400, Tom Lane wrote: David Fetter da...@fetter.org writes: * It's going to a lot of trouble to allow for the possibility of both unordered results and of duplicate lines. If we disallow duplicate lines in unordered result sets, we can get a big speed gain by using hash-based comparisons. Why not just sort the lines and compare? Good point :) Cheers, David. -- David Fetter da...@fetter.org http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] machine-readable explain output
On Mon, Jun 15, 2009 at 9:51 AM, Tom Lanet...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Hmm... on further review, I'm thinking this is still a bit wastful, because we don't really need (I think) to call TupleDescGetAttInMetadata from begin_tup_output_tupdesc. But I'm not sure what the best way is to avoid that. Any thoughts? Er, just don't do it? We shouldn't need it if the function is doing heap_form_tuple directly. Oh, I guess that works. I had thought there might be people calling begin_tup_output_tupdesc() who wanted to go on to call BuildTupleFromCStrings(), but it seems that's not the case. In fact, it looks like I can probably rip that member out of TupOutputState altogether. Will update patch. Does this look like what you were thinking otherwise? Thanks, ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Modifying TOAST_TUPLE_THRESHOLD and TOAST_TUPLE_TARGET?
I wrote: Proposed patch attached. That first version was of the minimally invasive variety, to stress how little I was changing and minimize the chance that I would make some dumb error; however, it involved copy/paste of a few lines which were already in a source file twice. Attached is what I hope is a functionally identical patch, with minor refactoring. I think it results in more readable code. -Kevin toast-main-out-of-line-reluctantly-2.diff Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] machine-readable explain output
Robert Haas robertmh...@gmail.com writes: it looks like I can probably rip that member out of TupOutputState altogether. Will update patch. Does this look like what you were thinking otherwise? Yeah, that's exactly what I was thinking. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postmaster recovery and automatic restart suppression
Hi -Original Message- From: ext Tom Lane [mailto:t...@sss.pgh.pa.us] Sent: Tuesday, June 09, 2009 9:20 PM To: Kolb, Harald (NSN - DE/Munich) Cc: Robert Haas; Greg Stark; Simon Riggs; Fujii Masao; pgsql-hackers@postgresql.org; Czichy, Thoralf (NSN - FI/Helsinki) Subject: Re: [HACKERS] postmaster recovery and automatic restart suppression Kolb, Harald (NSN - DE/Munich) harald.k...@nsn.com writes: If you don't want to see this option as a GUC parameter, would it be acceptable to have it as a new postmaster cmd line option ? That would make two kluges, not one (we don't do options that are settable in only one way). And it does nothing whatever to address my objection to the concept. regards, tom lane First point is understood. Second point needs further discussion: The recovery and restart feature is an excellent solution if the db is running in a standalone environment and I understand that this should not be weakened. But in a configuration where the db is only one resource among others and where you have a central supervisor, it's problematic. Then this central instance observes all the resources and services and decides what to do in case of problems. It's not up to the resource/service to make it's own decision because it's only a piece of the cake and doesn't has the complete view to the whole situation. E.g. the behaviour might be different if the problems occurr during an overload situation or if you already have hints to HW related problems or if you are in an upgrade procedure and the initial start fails. An uncontrolled and undetected automatic restart may complicate the situation and increase the outage time. Thus it would be helpful to have the possibility of a very fast failure detection (SIGCHLD in controlling instance) and to avoid wasteful cleanup procedures. If the db is embedded in a management (High Availability) environment, this option will be helpful in general, independent if you have a cluster or a single node. But in a cluster environment it would be more important to have this switch, because you always will have this management instance, the cluster software. And of course the main reason of a cluster is to switch over when it makes sense to do so. And one good reason to realy do it is when a central instance like the db on the primary side crashes. At least the user should have the possibility to decide this, but this would require that PostgreSQL constructively supports this situation. Regards, Harald. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postmaster recovery and automatic restart suppression
Kolb, Harald (NSN - DE/Munich) escribió: The recovery and restart feature is an excellent solution if the db is running in a standalone environment and I understand that this should not be weakened. But in a configuration where the db is only one resource among others and where you have a central supervisor, it's problematic. Then this central instance observes all the resources and services and decides what to do in case of problems. It's not up to the resource/service to make it's own decision because it's only a piece of the cake and doesn't has the complete view to the whole situation. Surely you can just stop the postmaster while it's on recovery from the supervisor when you detect this. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] What does log_destination = csvlog mean?
In reading through our documentation, I am unclear how log_destination = csvlog works. It seems to me that 'cvslog' is a format-output type, not a real destination, or rather it is a special output format for stderr. Is this accurate? I would like to clarify our documentation. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What does log_destination = csvlog mean?
Bruce Momjian wrote: In reading through our documentation, I am unclear how log_destination = csvlog works. It seems to me that 'cvslog' is a format-output type, not a real destination, or rather it is a special output format for stderr. Is this accurate? I would like to clarify our documentation. CSV logs can in fact only be delivered via redirected stderr, i.e. csvlog requires that logging_collector be on. So in a sense it's both a format and a destination. There is a strong technical reason for that, namely that only by doing that can be be sure that CSV logs won't get lines multiplexed, which would make loading them back into a table impossible. We invented a whole (simple) protocol between the backends and the syslogger just to handle that. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What does log_destination = csvlog mean?
Andrew Dunstan wrote: Bruce Momjian wrote: In reading through our documentation, I am unclear how log_destination = csvlog works. It seems to me that 'cvslog' is a format-output type, not a real destination, or rather it is a special output format for stderr. Is this accurate? I would like to clarify our documentation. CSV logs can in fact only be delivered via redirected stderr, i.e. csvlog requires that logging_collector be on. So in a sense it's both a format and a destination. There is a strong technical reason for that, namely that only by doing that can be be sure that CSV logs won't get lines multiplexed, which would make loading them back into a table impossible. We invented a whole (simple) protocol between the backends and the syslogger just to handle that. That's what I thought; thanks. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] char() overhead on read-only workloads not so insignifcant as the docs claim it is...
On Sat, Jun 13, 2009 at 3:44 PM, Stefan Kaltenbrunner ste...@kaltenbrunner.cc wrote: The specific query is causing bcTruelen to show up in the profile is: SELECT c from sbtest where id between $1 and $2 order by c where the parameters are for example $1 = '5009559', $2 = '5009658' - ie ranges of 100. benchmarking only that query results in: samples %symbol name 2148182 23.5861 bcTruelen 3694634.0565 index_getnext 3627843.9832 AllocSetAlloc Gurjeet Singh escribió: Comments? Maybe bcTruelen could be optimized to step on one word at a time (perhaps by using XOR against a precomputed word filled with ' '), instead of one byte at a time ... -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] char() overhead on read-only workloads not so insignifcant as the docs claim it is...
Alvaro, Maybe bcTruelen could be optimized to step on one word at a time (perhaps by using XOR against a precomputed word filled with ' '), instead of one byte at a time ... I have a patch for this, will send soon. Regards, Jeremy -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH] backend: compare word-at-a-time in bcTruelen
Signed-off-by: Jeremy Kerr j...@ozlabs.org --- src/backend/utils/adt/varchar.c | 24 +--- 1 file changed, 21 insertions(+), 3 deletions(-) diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/varchar.c index 5f3c658..6889dff 100644 --- a/src/backend/utils/adt/varchar.c +++ b/src/backend/utils/adt/varchar.c @@ -624,16 +624,34 @@ varchartypmodout(PG_FUNCTION_ARGS) static int bcTruelen(BpChar *arg) { + const unsigned int spaces = 0x20202020; + const int wordsize = sizeof(spaces); char *s = VARDATA_ANY(arg); int i; - int len; - len = VARSIZE_ANY_EXHDR(arg); - for (i = len - 1; i = 0; i--) + i = VARSIZE_ANY_EXHDR(arg) - 1; + + /* compare down to an aligned boundary */ + for (; i = 0 i % wordsize != wordsize - 1; i--) { if (s[i] != ' ') + return i + 1; + } + + /* now that we're aligned, compare word at a time */ + for (; i = wordsize - 1; i -= wordsize) + { + if (*(unsigned int *)(s + i - (wordsize - 1)) != spaces) break; } + + /* check within the last non-matching word */ + for (; i = 0; i--) + { + if (s[i] != ' ') + break; + } + return i + 1; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] backend: compare word-at-a-time in bcTruelen
On Jun 15, 2009, at 9:04 PM, Jeremy Kerr j...@ozlabs.org wrote: Signed-off-by: Jeremy Kerr j...@ozlabs.org --- src/backend/utils/adt/varchar.c | 24 +--- 1 file changed, 21 insertions(+), 3 deletions(-) diff --git a/src/backend/utils/adt/varchar.c b/src/backend/utils/adt/ varchar.c index 5f3c658..6889dff 100644 --- a/src/backend/utils/adt/varchar.c +++ b/src/backend/utils/adt/varchar.c @@ -624,16 +624,34 @@ varchartypmodout(PG_FUNCTION_ARGS) static int bcTruelen(BpChar *arg) { +const unsigned int spaces = 0x20202020; +const intwordsize = sizeof(spaces); This looks very non-portable to me. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] backend: compare word-at-a-time in bcTruelen
Robert, This looks very non-portable to me. Unsurprisingly, I'm new to postgres hacking and the large number of supported platforms :) I was considering something like: unsigned int spaces; const unsigned int wordsize = sizeof(unsigned int); memset(spaces, ' ', wordsize); In most cases, the compiler should be able to optimise the memset out, but it may introduce overhead where this is not possible. However, are there any supported platforms where sizeof(unsigned int) != 4 ? Cheers, Jeremy -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] backend: compare word-at-a-time in bcTruelen
Jeremy, * Jeremy Kerr (j...@ozlabs.org) wrote: Signed-off-by: Jeremy Kerr j...@ozlabs.org --- src/backend/utils/adt/varchar.c | 24 +--- 1 file changed, 21 insertions(+), 3 deletions(-) Thanks for the contribution. A couple of comments: The documentation for submitting a patch to PostgreSQL is here: http://wiki.postgresql.org/wiki/Submitting_a_Patch There is also a Developer FAQ available here: http://wiki.postgresql.org/wiki/Developer_FAQ The PostgreSQL core folks prefer context diffs (it's not my preference, but I'm not part of core, nor am I a committer :). There are a number of things requested to be included with a patch, but in particular I would point out: Which CVS branch the patch is against (ordinarily this will be HEAD). Whether it compiles and tests successfully, so we know nothing obvious is broken. Whether it contains any platform-specific items and if so, has it been tested on other platforms. Describe the effect your patch has on performance, if any. If the patch is intended to improve performance, it's a good idea to include some reproducible tests to demonstrate the improvement. You might check out sections 3 6 of src/include/c.h. Section 3 defines standard system types, while section 6 defines some widely useful macros; in particular our custom MemSet and MemSetAligned, which work on aligned memory structures for improved performance. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Should mdxxx functions(e.g. mdread, mdwrite, mdsync etc) PANIC instead of ERROR when I/O failed?
I think the reasoning is that if those functions reported a PANIC the chance you could recover your data is zero, because you need the database system to read the other (good) data. I do not see why PANIC reduced the chance to recover my data. AFAICS, my data has already corrupted(because of the bad-block here), whether PANIC or not, the read opertion on the bad-block should get the same result. Also, in the case you're complaining about, the problem was that there wasn't any O/S error report that we could have PANIC'd about anyhow. No, the O/S did report the error, which lead to the 453 ERROR messages of postgres. The O/S error messages(got this using dmesg) is like this: end_request: I/O error, dev sda, sector 504342711 ata1: EH complete SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: drive cache: write back ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 ata1.00: (irq_stat 0x4008) ata1.00: cmd 60/08:00:b0:a8:0f/00:00:1e:00:00/40 tag 0 cdb 0x0 data 4096 in res 41/40:08:b7:a8:0f/06:00:1e:00:00/00 Emask 0x9 (media error) ata1.00: ata_hpa_resize 1: sectors = 976773168, hpa_sectors = 976773168 ata1.00: ata_hpa_resize 1: sectors = 976773168, hpa_sectors = 976773168 We already do refuse to read a page into shared buffers if there's a read error on it, so it's not clear to me how you think that an ERROR leaves things in an unstable state. In my scene, it seems that the O/S does not ensure that if an I/O operation (read, write, sync, etc) on a block failed, then all later I/O operations on this block will also failed. For example: 1. As I noted before, although the bad db-block in my data has been read unsuccessfully for 453 times, but the 454th read operation succeeds(but some data(the bad sector) has been set to all-zero). So, even if the 453 failed I/O has reported ERROR, there is still chance that the bad db-block can be read in shared buffres. 2. Besides, I have noticed a scene like this: 1)an mdsync operations failed with the message ERROR: could not fsync segment XXX of relation XXX: ??; The error message of O/S(I get this using dmesg command) is like this: Buffer I/O error on device ^A#63733;XX205503, logical block 43837786 lost page write due to I/O error on ^A#63733;XX205503 2) This leaves a half-writen db-block in my data. But the page can still be read in shared buffers successfully later, which leads to an curious scene that says ERROR: could not access status of transaction X -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] How to embed postgresql?
Dears, How to embed postgresql if possible? In our Java application solution, we need an embedded database. Berkeley DB can not meet our reqirements in performance and license, so we wish to port postgresql as embedded database. Architecture: Java - JNI - the API is wrapped from Plan/Executor modules What's the feasibility? If it's possible, look for your coaching! Many thanks! Bruce
[HACKERS] question about meaning of character varying without length
Hello, I've found following description: If character varying is used without length specifier, the type accepts strings of any size. The latter is a PostgreSQL extension. Does this mean that character varying without length is equivalent to text type. Are there any differences? I noticed that ODBC driver processes the type differently from text. Please help! Konstantin
Re: [HACKERS] [PATCH] backend: compare word-at-a-time in bcTruelen
On Mon, Jun 15, 2009 at 9:51 PM, Jeremy Kerrj...@ozlabs.org wrote: I was considering something like: unsigned int spaces; const unsigned int wordsize = sizeof(unsigned int); memset(spaces, ' ', wordsize); In most cases, the compiler should be able to optimise the memset out, but it may introduce overhead where this is not possible. What about just: static char spaces[4] = { ' ', ' ', ' ', ' ' }; and then ... * (uint32 *) spaces? There's not much point taking the length of the word when you've initialized it to contain exactly 4 bytes. What you want to do is pick the flavor of integer that will be the same length as what you've initialized, and it turns out we already have that (see src/include/c.h). As I look at this, another problem is that it seems to me that you're assuming that VARDATA_ANY() will return an aligned pointer, which isn't necessarily the case (see src/include/postgres.h). The advice in Stephen's email is also very good - in particular, whatever you come up with, you should submit performance results. Note that while --enable-profiling is very useful and profiling numbers are good to submit, you'll also want to make sure you do a build that is optimized for speed (i.e. no profiling, no casserts, no debug) and do timings on that. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] question about meaning of character varying without length
On Tue, Jun 16, 2009 at 12:24 AM, Konstantin Izmailovpgf...@gmail.com wrote: Hello, I've found following description: If character varying is used without length specifier, the type accepts strings of any size. The latter is a PostgreSQL extension. Does this mean that character varying without length is equivalent to text type. Are there any differences? I noticed that ODBC driver processes the type differently from text. Please help! Konstantin This question would be more appropriate for pgsql-general or maybe pgsql-odbc, since it is not a question about the development of PostgreSQL. character varying and text are different types, but there's no real difference between them. I can't speak to what the ODBC driver does. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to embed postgresql?
On Tue, Jun 16, 2009 at 12:13 AM, Bruce YUANsua...@gmail.com wrote: How to embed postgresql if possible? See http://wiki.postgresql.org/wiki/Todo#Features_We_Do_Not_Want ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] question about meaning of character varying without length
I have tried to send to pgsql-general twice, each time it returns error: Relay access denied (state 14). Will try to post to pgsql-odbc. Thank you! On Mon, Jun 15, 2009 at 9:29 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jun 16, 2009 at 12:24 AM, Konstantin Izmailovpgf...@gmail.com wrote: Hello, I've found following description: If character varying is used without length specifier, the type accepts strings of any size. The latter is a PostgreSQL extension. Does this mean that character varying without length is equivalent to text type. Are there any differences? I noticed that ODBC driver processes the type differently from text. Please help! Konstantin This question would be more appropriate for pgsql-general or maybe pgsql-odbc, since it is not a question about the development of PostgreSQL. character varying and text are different types, but there's no real difference between them. I can't speak to what the ODBC driver does. ...Robert
Re: [HACKERS] question about meaning of character varying without length
On Tue, Jun 16, 2009 at 12:45 AM, Konstantin Izmailovpgf...@gmail.com wrote: I have tried to send to pgsql-general twice, each time it returns error: Relay access denied (state 14). Will try to post to pgsql-odbc. Could you provide the full bounce message? ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch for automating partitions in PostgreSQL 8.4 Beta 2
Hi Nikhil, I am sorry for the late reply. :( Please find inline my comments. On Tue, Jun 9, 2009 at 2:54 PM, Nikhil Sontakke nikhil.sonta...@enterprisedb.com wrote: Hi, The patch automates table partitioning to support Range and Hash partitions. Please refer to attached readme file for further details. The syntax used conforms to most of the suggestions mentioned in http://archives.postgresql.org/pgsql-hackers/2008-01/msg00413.php, barring the following: -- Specification of partition names is optional. System will be able to generate partition names in such cases. -- Sub partitioning Some comments based on a brief glance of the patch: - The logic to execute the partition triggers last still needs some more work IMHO. Relying on just the names might not get accepted. I think you should pay attention to Andrew Dunstan's suggestion in an earlier mail to have tgkind enumerations to generalize the same or discuss it further. the scheme should turn tgisconstraint into a multi-valued item (tgkind: 'u' = userland, 'c'= constraint, 'p' = partition or some such). Kedar I am working on to turn tgisconstraint into a multi-values item to categorize user, constraint and partition triggers. In doing so, I am thinking of adding adding 'PARTITION' keyword to existing syntax to create partition triggers i.e. CREATE PARTITION TRIGGER ... The partition triggers would now be identified with the type rather than naming scheme though naming scheme for partition triggers would still be there. - Similarly, assigning of_relname_oid names to overflow tables also might not work. The best way ahead could be to invent a new relkind RELKIND_OVERFLOW to handle it. Or maybe we can have a new schema pg_overflow to store the overflow relation with the same name (suffixed with _overflow to make it clearer) as the parent relation too. The relkind solution might be cleaner though. This might need further discussion. In general, it is definitely not a bad idea to discuss such sub-problems on the list :) Kedar I will look at it once done with catagorizing partition triggers. I am inclined toward using a new relkind like RELKIND_OVERFLOW. - Am I reading the patch correctly that you do not end up creating indexes on the children tables? That is a big problem! Kedar As Srinath K is working on global indexes, the merge of partitioning and global indexes should be able to extend indexes created on partitioned table to children as well. - You can remove the remnants of the first patch like the MutateColumnRefs() function, for example (I agree this is WIP, but unwanted/unused functions unnecessarily add to the size). With large patches, the more precise the patch, the better it will be for reviewers/readers. Kedar Yeah. Great work all in all! Kedar Thanks! Looking forward to your continued co-operation. Regards, Nikhils -- http://www.enterprisedb.com
Re: [HACKERS] question about meaning of character varying without length
Here you go: fromMail Delivery Subsystem mailer-dae...@googlemail.comtopgf...@gmail.com dateMon, Jun 15, 2009 at 9:16 PMsubjectDelivery Status Notification (Failure) hide details 9:16 PM (31 minutes ago) Reply [image: Follow up message] This is an automatically generated Delivery Status Notification Delivery to the following recipient failed permanently: pgsql-gene...@postgresql.com Technical details of permanent failure: Google tried to deliver your message, but it was rejected by the recipient domain. We recommend contacting the other email provider for further information about the cause of this error. The error that the other server returned was: 554 554 5.7.1 pgsql-gene...@postgresql.com: Relay access denied (state 14). - Original message - MIME-Version: 1.0 Received: by 10.142.237.19 with SMTP id k19mr3677653wfh.31.1245125814649; Mon, 15 Jun 2009 21:16:54 -0700 (PDT) Date: Mon, 15 Jun 2009 21:16:54 -0700 Message-ID: 72746b5e0906152116t13e67918l3eb58b5c304fd...@mail.gmail.com Subject: question about meaning of character varying without length From: Konstantin Izmailov pgf...@gmail.com To: pgsql-gene...@postgresql.com Content-Type: multipart/alternative; boundary=000e0cd24232e205f2046c6f7190 --000e0cd24232e205f2046c6f7190 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Hello, I've found following description: If character varying is used without length specifier, the type accepts strings of any size. The latter is a PostgreSQL extension. Does this mean that character varying without length is equivalent to - Message truncated - On Mon, Jun 15, 2009 at 9:47 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Jun 16, 2009 at 12:45 AM, Konstantin Izmailovpgf...@gmail.com wrote: I have tried to send to pgsql-general twice, each time it returns error: Relay access denied (state 14). Will try to post to pgsql-odbc. Could you provide the full bounce message? ...Robert
Re: [HACKERS] change do_tup_output to take Datum arguments rather than cstring
On Mon, Jun 15, 2009 at 1:20 PM, Tom Lanet...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: it looks like I can probably rip that member out of TupOutputState altogether. Will update patch. Does this look like what you were thinking otherwise? Yeah, that's exactly what I was thinking. Excellent. Revised patch attached. ...Robert *** a/src/backend/executor/execTuples.c --- b/src/backend/executor/execTuples.c *** *** 95,100 --- 95,101 #include catalog/pg_type.h #include nodes/nodeFuncs.h #include storage/bufmgr.h + #include utils/builtins.h #include utils/lsyscache.h #include utils/typcache.h *** *** 1204,1210 begin_tup_output_tupdesc(DestReceiver *dest, TupleDesc tupdesc) tstate = (TupOutputState *) palloc(sizeof(TupOutputState)); ! tstate-metadata = TupleDescGetAttInMetadata(tupdesc); tstate-slot = MakeSingleTupleTableSlot(tupdesc); tstate-dest = dest; --- 1205,1211 tstate = (TupOutputState *) palloc(sizeof(TupOutputState)); ! tstate-tupdesc = tupdesc; tstate-slot = MakeSingleTupleTableSlot(tupdesc); tstate-dest = dest; *** *** 1216,1232 begin_tup_output_tupdesc(DestReceiver *dest, TupleDesc tupdesc) /* * write a single tuple * - * values is a list of the external C string representations of the values - * to be projected. - * * XXX This could be made more efficient, since in reality we probably only * need a virtual tuple. */ void ! do_tup_output(TupOutputState *tstate, char **values) { ! /* build a tuple from the input strings using the tupdesc */ ! HeapTuple tuple = BuildTupleFromCStrings(tstate-metadata, values); /* put it in a slot */ ExecStoreTuple(tuple, tstate-slot, InvalidBuffer, true); --- 1217,1233 /* * write a single tuple * * XXX This could be made more efficient, since in reality we probably only * need a virtual tuple. */ void ! do_tup_output(TupOutputState *tstate, Datum *values, bool *isnull) { ! TupleDesc tupdesc = tstate-tupdesc; ! HeapTuple tuple; ! ! /* Form a tuple. */ ! tuple = heap_form_tuple(tupdesc, values, isnull); /* put it in a slot */ ExecStoreTuple(tuple, tstate-slot, InvalidBuffer, true); *** *** 1241,1264 do_tup_output(TupOutputState *tstate, char **values) /* * write a chunk of text, breaking at newline characters * - * NB: scribbles on its input! - * * Should only be used with a single-TEXT-attribute tupdesc. */ void do_text_output_multiline(TupOutputState *tstate, char *text) { while (*text) { char *eol; eol = strchr(text, '\n'); if (eol) ! *eol++ = '\0'; else ! eol = text +strlen(text); ! do_tup_output(tstate, text); text = eol; } } --- 1242,1275 /* * write a chunk of text, breaking at newline characters * * Should only be used with a single-TEXT-attribute tupdesc. */ void do_text_output_multiline(TupOutputState *tstate, char *text) { + Datum values[1]; + bool isnull[1] = { false }; + while (*text) { char *eol; + int len; eol = strchr(text, '\n'); if (eol) ! { ! len = eol - text; ! ++eol; ! } else ! { ! len = strlen(text); ! eol += len; ! } ! values[0] = PointerGetDatum(cstring_to_text_with_len(text, len)); ! do_tup_output(tstate, values, isnull); ! pfree(DatumGetPointer(values[0])); text = eol; } } *** *** 1269,1274 end_tup_output(TupOutputState *tstate) (*tstate-dest-rShutdown) (tstate-dest); /* note that destroying the dest is not ours to do */ ExecDropSingleTupleTableSlot(tstate-slot); - /* XXX worth cleaning up the attinmetadata? */ pfree(tstate); } --- 1280,1284 *** a/src/backend/utils/misc/guc.c --- b/src/backend/utils/misc/guc.c *** *** 5978,5984 ShowAllGUCConfig(DestReceiver *dest) int i; TupOutputState *tstate; TupleDesc tupdesc; ! char *values[3]; /* need a tuple descriptor representing three TEXT columns */ tupdesc = CreateTemplateTupleDesc(3, false); --- 5978,5985 int i; TupOutputState *tstate; TupleDesc tupdesc; ! Datum values[3]; ! bool isnull[3] = { false, false, false }; /* need a tuple descriptor representing three TEXT columns */ tupdesc = CreateTemplateTupleDesc(3, false); *** *** 5996,6017 ShowAllGUCConfig(DestReceiver *dest) for (i = 0; i num_guc_variables; i++) { struct config_generic *conf = guc_variables[i]; if ((conf-flags GUC_NO_SHOW_ALL) || ((conf-flags GUC_SUPERUSER_ONLY) !am_superuser)) continue; /* assign to the values array */ ! values[0] = (char *) conf-name; ! values[1] = _ShowOption(conf, true); ! values[2] = (char *) conf-short_desc; /* send it to dest */ ! do_tup_output(tstate, values); /* clean up */ ! if (values[1] != NULL) ! pfree(values[1]); }
Re: [HACKERS] How to embed postgresql?
If you are working with Java, it is better for you to check H2 database, we also have Java application and until recently we worked with Postgres, few month ago we replaced it with H2. Regards Dror Date: Tue, 16 Jun 2009 12:13:38 +0800 Subject: [HACKERS] How to embed postgresql? From: sua...@gmail.com To: pgsql-hackers@postgresql.org Dears, How to embed postgresql if possible? In our Java application solution, we need an embedded database. Berkeley DB can not meet our reqirements in performance and license, so we wish to port postgresql as embedded database. Architecture: Java - JNI - the API is wrapped from Plan/Executor modules What's the feasibility? If it's possible, look for your coaching! Many thanks! Bruce _ Hotmail® has ever-growing storage! Don’t worry about storage limits. http://windowslive.com/Tutorial/Hotmail/Storage?ocid=TXT_TAGLM_WL_HM_Tutorial_Storage_062009