Re: [HACKERS] Block-level CRC checks
OK, I have a stupid question- torn pages are a problem, but only during recovery. Recovery is (I assume) a fairly rare condition- if data corruption is going to happen, it's most likely to happen during normal operation. So why not just turn off CRC checksumming during recovery, or at least treat it as a much less critical error? During recovery, if the CRC checksum matches, we can assume the page is good- not only not corrupt, but not torn either. If the CRC checksum doesn't match, we don't panic, but maybe we do more careful analysis of the page to make sure that only the hint bits are wrong. Or maybe not. It's only during normal operation that a CRC checksum failure would be considered critical. Feel free to explain to me why I'm an idiot. Brian -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Jonah H. Harris wrote: On Thu, Oct 2, 2008 at 9:07 AM, Brian Hurt <[EMAIL PROTECTED]> wrote: I have a stupid question wrt hint bits and CRC checksums- it seems to me that it should be possible, if you change the hint bits, to be able to very easily calculate what the change in the CRC checksum should be. Doesn't the problem still remain? The problem being that the buffer can be changed as it's written, yes? Another possibility is to just not checksum the hint bits... Brian -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
I have a stupid question wrt hint bits and CRC checksums- it seems to me that it should be possible, if you change the hint bits, to be able to very easily calculate what the change in the CRC checksum should be. The basic idea of the CRC checksum is that, given a message x, the checksum is x mod p where p is some constant polynomial (all operations are in GF(2^n)). Now, the interesting thing about modulo is that it's distributable- that is to say, (x ^ y) mod p = (x mod p) ^ (y mod p), and that (x * y) mod p = ((x mod p) * (y mod p)) mod p (I'm using ^ instead of the more traditional + here to emphasize that it's xor, not addition, I'm doing). So let's assume we're updating a word a known n bytes from the end of the message- we calculate y = old_value ^ new_value, so our change is the equivalent of changing the original block m to (m ^ (y * x^{8n})). The new checksum is then (m ^ (y * x^{8n})) mod p = (m mod p) ^ (((y mod p) * (x^{8n} mod p)) mod p). Now, m mod p is the original checksum, and (x^{8n} mod p) is a constant for a given n, and the multiplication modulo p can be implemented as a set of table lookups, one per byte. The take away here is that, if we know ahead of time where the modifications are going to be, we can make updating the CRC checksum (relatively) cheap. So, for example, a change of the hint bits would only need 4 tables lookups and a couple of xors to update the block's CRC checksum. We could extended this idea- break the 8K page up into, say, 32 256-byte "subblocks". Changing any given subblock would require only re-checksumming that subblock and then updating the CRC checksum. The reason for the subblocks would be to limit the number of tables necessary- each subblock requires it's own set of 4 256-word tables, so having 32 subblocks means that the tables involved would be 32*4*256*4 = 128K in size. Going to, say, 64 byte subblocks means needing 128 tables or 512K of tables. If people are interested, I could bang out the code tonight, and post it to the list tomorrow. Brian -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Brian Hurt wrote: Paul Schlie wrote: ... if that doesn't fix the problem, assume a single bit error, and iteratively flip single bits until the check sum matches ... This can actually be done much faster, if you're doing a CRC checksum (aka modulo over GF(2^n)). Basically, an error flipping bit n will always create the same xor between the computed CRC and the stored CRC. So you can just store a table- for all n, an error in bit n will create an xor of this value, sort the table in order of xor values, and then you can do a binary search on the table, and get exactly what bit was wrong. This is actually probably fairly safe- for an 8K page, there are only 65536 possible bit positions. Assuming a 32-bit CRC, that means that larger corrupts are much more likely to hit one of the other 4,294,901,760 (2^32 - 2^16) CRC values- 99.998% likely, in fact. Actually, I think I'm going to take this back. Thinking about it, the table is going to be large-ish (~512K) and it assumes a fixed 8K page size. I think a better solution would be a tight loop, something like: r = 1u; for (i = 0; i < max_bits_per_page; ++i) { if (r == xor_difference) { return i; } else if ((r & 1u) == 1u) { r = (r >> 1) ^ CRC_POLY; } else { r >>= 1; } } Brian -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Block-level CRC checks
Paul Schlie wrote: ... if that doesn't fix the problem, assume a single bit error, and iteratively flip single bits until the check sum matches ... This can actually be done much faster, if you're doing a CRC checksum (aka modulo over GF(2^n)). Basically, an error flipping bit n will always create the same xor between the computed CRC and the stored CRC. So you can just store a table- for all n, an error in bit n will create an xor of this value, sort the table in order of xor values, and then you can do a binary search on the table, and get exactly what bit was wrong. This is actually probably fairly safe- for an 8K page, there are only 65536 possible bit positions. Assuming a 32-bit CRC, that means that larger corrupts are much more likely to hit one of the other 4,294,901,760 (2^32 - 2^16) CRC values- 99.998% likely, in fact. Brian (hopefully not making the problem worse as may be the case if many bits were actually already in error) and write the data back, and proceed as normal, possibly logging the action; otherwise presume the data is unrecoverable and in error, somehow mark it as being so such that subsequent queries which may utilize any portion of it knows it may be corrupt (which I suspect may be best done not on file-system blocks, but actually on a logical rows or even individual entries if very large, as my best initial guess, and likely to measurably affect performance when enabled, and haven't a clue how resulting query should/could be identified as being potentially corrupt without confusing the client which requested it). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Core team statement on replication in PostgreSQL
Andrew Sullivan wrote: On Thu, May 29, 2008 at 12:11:21PM -0400, Brian Hurt wrote: Being able to do read-only queries makes this feature more valuable in more situations, but I disagree that it's a deal-breaker. Your managers are apparently more enlightened than some. ;-) A No doubt. But defining the minimum acceptable feature set by the demands of the dumbest manager is a no-win proposition. Brian
Re: [HACKERS] Core team statement on replication in PostgreSQL
David Fetter wrote: This part is a deal-killer. It's a giant up-hill slog to sell warm standby to those in charge of making resources available because the warm standby machine consumes SA time, bandwidth, power, rack space, etc., but provides no tangible benefit, and this feature would have exactly the same problem. IMHO, without the ability to do read-only queries on slaves, it's not worth doing this feature at all. I don't think I agree with this. There are a large number of situations where it's positive expectancy to do precisely this- it's not unlike buying a $1 lottery ticket with a 1 chance in 100 of winning $1000- the vast majority of the time (99 times out of 100), you're going to lose $1. But when you win, you win big, and make up for all the small losses you incurred getting there and then some. Failover machines are like that- most of the time they're negative value, as you said- taking up SA time, bandwidth, power, rack space, money, etc. But every once in a (great) while, they save you. If the cost of having the database down for hours or days (as you madly try to next-day replacement hardware) isn't that great, then no, this isn't worthwhile- but in cases where the database being down chalks up the lost money quickly, this is easy to cost-justify. Being able to do read-only queries makes this feature more valuable in more situations, but I disagree that it's a deal-breaker. Brian
Re: [HACKERS] An idea for parallelizing COPY within one backend
Andrew Dunstan wrote: Florian G. Pflug wrote: Would it be possible to determine when the copy is starting that this case holds, and not use the parallel parsing idea in those cases? In theory, yes. In pratice, I don't want to be the one who has to answer to an angry user who just suffered a major drop in COPY performance after adding an ENUM column to his table. I am yet to be convinced that this is even theoretically a good path to follow. Any sufficiently large table could probably be partitioned and then we could use the parallelism that is being discussed for pg_restore without any modification to the backend at all. Similar tricks could be played by an external bulk loader for third party data sources. I was just floating this as an idea- I don't know enough about the backend to know if it was a good idea or not, it sounds like "not". Brian ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] An idea for parallelizing COPY within one backend
Tom Lane wrote: "Florian G. Pflug" <[EMAIL PROTECTED]> writes: ... Neither the "dealer", nor the "workers" would need access to the either the shared memory or the disk, thereby not messing with the "one backend is one transaction is one session" dogma. ... Unfortunately, this idea has far too narrow a view of what a datatype input function might do. Just for starters, consider "enum" input, which certainly requires catalog access. We have also explicitly acknowledged the idea that datatype I/O functions might try to store typmod-related data in some special catalog somewhere. regards, tom lane Would it be possible to determine when the copy is starting that this case holds, and not use the parallel parsing idea in those cases? I'm a big user of copy, generally into very simple tables- few indexes, simple column types (numeric, varchar, and int almost exclusively), no fancy features. A parallel copy input in the "simple" cases would be of great advantage to me, even if it doesn't parallelize "complicated" cases. Brian
Re: [HACKERS] Slow count(*)
Kevin Grittner wrote: If you really are doing proper maintenance, and you don't need exact counts, you might be able to use the approximation stored in the system tables: Also, if you're using count(*) as an existance test (common in Mysql code), it's better to use exists instead. Using a table in my system, I see: proddb=> explain analyze select count(*) from instrument_listings where update_date is null and delist_date is null; QUERY PLAN --- Aggregate (cost=18385.45..18385.46 rows=1 width=0) (actual time=897.799..897.801 rows=1 loops=1) -> Seq Scan on instrument_listings (cost=0.00..17973.43 rows=164807 width=0) (actual time=0.018..634.197 rows=146122 loops=1) Filter: ((update_date IS NULL) AND (delist_date IS NULL)) Total runtime: 897.846 ms (4 rows) Time: 898.478 ms proddb=> explain analyze select true where exists(select 1 from instrument_listings where update_date is null and delist_date is null limit 1); QUERY PLAN Result (cost=0.11..0.12 rows=1 width=0) (actual time=0.026..0.028 rows=1 loops=1) One-Time Filter: $0 InitPlan -> Limit (cost=0.00..0.11 rows=1 width=0) (actual time=0.022..0.022 rows=1 loops=1) -> Seq Scan on instrument_listings (cost=0.00..17973.43 rows=164807 width=0) (actual time=0.017..0.017 rows=1 loops=1) Filter: ((update_date IS NULL) AND (delist_date IS NULL)) Total runtime: 0.063 ms (7 rows) Time: 0.768 ms proddb=> The exists version is over 1000x faster (and the only reason it's not more studly is that I'm working on the table as we speak, so it's all in memory). As a general rule in postgres, don't do count(*) unless you really mean it. Brian ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Sorting Improvements for 8.4
Brian Hurt wrote: While we're blue skying things, I've had an idea for a sorting algorithm kicking around for a couple of years that might be interesting. It's a variation on heapsort to make it significantly more block-friendly. I have no idea if the idea would work, or how well it'd work, but it might be worthwhile kicking around. Now, the core idea of heapsort is that the array is put into heap order- basically, that a[i] >= a[2i+1] and a[i] >= a[2i+2] (doing the 0-based array version here). The problem is that, assuming that the length of a is larger than memory, then a[2i+1] is likely going to be on a different page or block than a[i]. That means every time you have to bubble down a new element, you end up reading O(log N) blocks- this is *per element*. The variation is to instead work with blocks, so you have a block of entries b[i], and you change the definition of heap order, so that min(b[i]) >= max(b[2i+1]) and min(b[i]) >= max(b[2i+2]). Also, during bubble down, you need to be carefull to only change the minimum value of one of the two child blocks b[2i+1] and b[2i+2]. Other than that, the algorithm works as normal. The advantage of doing it this way is that while each bubble down still takes O(log N) blocks being touched, you get a entire block worth of results for your effort. Make your blocks large enough (say, 1/4 the size of workmem) and you greatly reduce N, the number of blocks you have to deal with, and get much better I/O (when you're reading, you're reading megabytes at a shot). Now, there are boatloads of complexities I'm glossing over here. This is more of a sketch of the idea. But it's something to consider. Following up to myself (my apologies), but it's occurred to me that there are three advantages to this proposal that I've since thought of: 1) The two child blocks b[2i+1] and b[2i+2]- the one with the larger minimum element is the one we might replace. In other words, if min(b[2i+1]) > min(b[2i+2]) and min(b[i]) < min(b[2i+1]), then we know we're going to want the blocks b[4i+3] and b[4i+4]- before we're done with blocks b[2i+1] and b[2i+2]. The point here is that this would work wonders with the posix_fadvise/asyncio ideas kicking around. It'd be easy for the code to keep 2 large writes and 2 large reads going pretty constantly. 2) There is some easy parallelization available. I'm not sure how much worth this is, but the bubble down code is fairly easy to parallelize. If we have two bubble-downs going on in parallel, once they go down different branches (one thread goes to block b[2i+1] while the other goes to b[2i+2]) they no longer interact. Blocks near the root of the heap would be contended over, and multiple threads means smaller blocks to keep the total memory foot print the same. Personally, I think the asyncio idea above is more likely to be worthwhile. 3) It's possible to perform the sort lazily. You have the initial O(N) pass over the list, but then each block is only O(log N) cost. If it's likely that only the first part of the result is needed, then much of the work can be avoided. Brian ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Sorting Improvements for 8.4
While we're blue skying things, I've had an idea for a sorting algorithm kicking around for a couple of years that might be interesting. It's a variation on heapsort to make it significantly more block-friendly. I have no idea if the idea would work, or how well it'd work, but it might be worthwhile kicking around. Now, the core idea of heapsort is that the array is put into heap order- basically, that a[i] >= a[2i+1] and a[i] >= a[2i+2] (doing the 0-based array version here). The problem is that, assuming that the length of a is larger than memory, then a[2i+1] is likely going to be on a different page or block than a[i]. That means every time you have to bubble down a new element, you end up reading O(log N) blocks- this is *per element*. The variation is to instead work with blocks, so you have a block of entries b[i], and you change the definition of heap order, so that min(b[i]) >= max(b[2i+1]) and min(b[i]) >= max(b[2i+2]). Also, during bubble down, you need to be carefull to only change the minimum value of one of the two child blocks b[2i+1] and b[2i+2]. Other than that, the algorithm works as normal. The advantage of doing it this way is that while each bubble down still takes O(log N) blocks being touched, you get a entire block worth of results for your effort. Make your blocks large enough (say, 1/4 the size of workmem) and you greatly reduce N, the number of blocks you have to deal with, and get much better I/O (when you're reading, you're reading megabytes at a shot). Now, there are boatloads of complexities I'm glossing over here. This is more of a sketch of the idea. But it's something to consider. Brian ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Feature Freeze date for 8.4
Josh Berkus wrote: Folks, You are way ahead of us here. And my vote *still* goes to Mercurial, if we're picking SCMs. Will a new SCM actually make this easier, or are people just using it as an excuse? We use mercurial here at work, having switched to it recently, and while I don't claim to be an expert, it does seem nice. For example, you can have a local repository you're checking code into, and can pull from and merge up with some shared repository. Also, you can pull from one repository and check into another- so, for example, we have a staging repository and a compiles repository (unless you welcome the pain)- you pull from the compiles repository, but push changes back to the staging repository. Then we have a script that pulls recent changes from the staging repository, make sure they compile and the unit tests run, before moving them over to the compiles repository. This way, the version you're pulling at least compiles and passes some minimal unit tests. A similiar process could work for postgres- except instead of "staging" and "compiles" you'd have a "sumbitted" and "accepted" repositories. And instead of a compile daemon, it'd be reviewers who would move code from one to the other. Note that everything I'm talking about here is not unique to Mercurial- you can do this just about as easily in darcs or git (I'd advise against Bazaar/bzr)- so don't take this as being pro-Mercurial, just pro-SCM. Brian
Re: [HACKERS] Hash index todo list item
Kenneth Marshall wrote: How likely is it that you will get a hash collision, two strings that are different that will hash to the same value? To avoid this requires a very large hash key (128 bits, minimum)- otherwise you get into birthday attack problems. With a 32-bit hash, the likelyhood is greater than 50% that two strings in a collection of 100,000 will hash to the same value. With a 64-bit hash, the likelyhood is greater than 50% that two strings in a collection of 10 billion will has to same value. 10 billion is a large number, but not an unreasonable number, of strings to want to put into a hash table- and it's exactly this case where the O(1) cost of hashtables starts being a real win. Brian Yes, there is a non-negligible chance of collision (In a DB is there any chance that is non-negligible? :) ) and the values must be checked against the actual. The win is the collapse of the index size and only needed to check a small fraction of the actual tuples. Ah, OK- I misunderstood you. I thought you were saying that the hash values would need to be unique, and you wouldn't check the original values at all. My bad. Brian
Re: [HACKERS] Hash index todo list item
Kenneth Marshall wrote: I understand that a hash value is a many-to-one mapping. That is the point of the flag in the index. The flag means that there is only one item in the heap corresponding to that hash value. In this case we know that the value in the heap is the correct one and a possibly very expensive string comparison can be skipped. Given that the hash function is doing its job, almost every string comparison can be skipped. How long would it take to compare 1-32K of data? How much CPU usage? With this field in place, you only need to check tuple visibility. How likely is it that you will get a hash collision, two strings that are different that will hash to the same value? To avoid this requires a very large hash key (128 bits, minimum)- otherwise you get into birthday attack problems. With a 32-bit hash, the likelyhood is greater than 50% that two strings in a collection of 100,000 will hash to the same value. With a 64-bit hash, the likelyhood is greater than 50% that two strings in a collection of 10 billion will has to same value. 10 billion is a large number, but not an unreasonable number, of strings to want to put into a hash table- and it's exactly this case where the O(1) cost of hashtables starts being a real win. Brian ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Should pointers to PGPROC be volatile-qualified?
Tom Lane wrote: Comments? Does anyone think the C standard forbids what I'm worried about? My understanding of the C spec is that it explicitly *allows* for exactly what you're afraid of. It's even possible if the uses include function calls, as the compiler might inline the function calls. The downside of litering the code with volatile qualifications is that it's an optimization stopper. For example, if proc is declared volatile, the compiler couldn't merge multiple different proc->foo references into a single load into a register. Note that all sorts of weirdnesses are possible when you have shared mutable state between multiple different threads. For example, assume you have two threads, and two global ints x and y, initially both 0. Thread 1 do: y = 1; r1 = x; (where r1 is a local variable in thread 1), while thread 2 does: x = 1; r2 = y; (with r2 being a local variable in thread 2). Here's the thing: both r1 and r2 can end up 0! I've seen this in real code. What happens is that the compiler notices that in both cases, the load and stores are independent, so it can reorder them. And as loads tend to be expensive, and nothing can progress until the load completes, it moves the loads up before the stores, assuming the program won't notice. Unfortunately, it does, as "the impossible" can then happen. Brian ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] SQL feature requests
Chuck McDevitt wrote: Tom, it isn't just a case of "convenience". When we are trying to convert users from another database (say Oracle for example) to PostgeSQL, one of the big stumbling blocks that slows down the work is all the little query changes that people have to make (people who might not have written the query as well), and it can require the review of hundreds or thousands of SQL scripts and applications. The harder it is, the more reluctant they are to convert. Sometimes supporting "de-facto" standards as well as official standards makes sense. One of the ways "de-facto" standards are even better than standards is that there are even more of them. Which de-facto standard should we follow: Oracle, MySQL, or MS SQL Server? Note that following all of them is a bad idea, as one persons convience is another persons stumbling block. Brian ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] Possible feature request
I'm not sure if this is the right venue for this- if it isn't, I apologize. But a feature that I'd like to see is the ability to put an 'as' alias on the target of an update statement. I find myself often writing statements like: UPDATE some_really_long_table_name SET col1 = some_value, col2 = some_other_value FROM some_other_really_long_table_name AS table2 WHERE some_really_long_table_name.col3 = table2.col3 AND some_really_long_table_name.col4 = table2.col4 ; What I'd like to write is: UPDATE some_really_long_table_name AS table1 SET col1 = some_value, col2 = some_other_value FROM some_other_really_long_table_name AS table2 WHERE table1.col3 = table2.col3 AND table1.col4 = table2.col4 ; Is this possible? Not for 8.3, obviously, but maybe for 8.4? Brian ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] GUC time unit spelling a bit inconsistent
Peter Eisentraut wrote: Am Donnerstag, 21. Juni 2007 00:38 schrieb Gregory Stark: I think people are worried that an 'm' in one column might mean something different than an 'm' in another column, and perhaps that is confusing. To whom? the person writing it? If everyone around here had gotten their way we'd already be in a situation were you could write log_rotation_age = 5m log_rotation_size = 5m Would it be a bad idea to always support the multi-character abbreviations? So the min, sec, hr, day, wk, mth, yr ar all, to my knowledge, unambiguous. As are kb, mb, gb, tb. So the above would be: log_rotation_age = 5min log_rotation_size = 5mb edinburgh_distance=5mi Brian
Re: [HACKERS] Selecting a constant question
Larry McGhaw wrote: I'm really frustrated by this process I'm not trying to attack anyone here. I'm just surprised that no one will even entertain the idea that this is an issue that needs to be addressed. Instead nearly all of the responses have been attacking the applications that rely on the metadata. Having been following this debate, I think what people have really been attacking is the idea that the metadata for: SELECT '1' AS varchar_column; should be different from the metadata for: SELECT varchar_column FROM really_big_table; or for: SELECT varchar_column FROM really_small_table; Or at least that's what I've taken away from the dicussion- it's not so much that the metadata shouldn't be relied on, it's that the metadata may be more generic than theoretically necessary. And that the metadata may not contain the length of a variable length field even when that length is known. Brian ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Ye olde drop-the-database-you-just-left problem
Tom Lane wrote: I just finished giving someone the standard advice to wait a bit before trying to drop a database that'd just been accessed: http://archives.postgresql.org/pgsql-general/2007-05/msg01505.php AFAICT a "real" fix for this would involve making PQfinish() synchronous (don't return till backend is dead), which doesn't seem like a great idea. However, it suddenly struck me that we could probably make most of the problem go away if we put that same wait into DROP DATABASE itself --- that is, if we see other backends in the target DB, sleep for a second or two and then recheck before erroring out. This isn't bulletproof since under high load the other backend might not get to quit, but it'd surely reduce the frequency of complaints a great deal. And we could take out the ad-hoc sleeps that are done in (eg) the contrib regression tests. Thoughts? Is this a synchronization issue? I'm wondering if there isn't a better solution. The problem with waiting is that a) you're going to be waiting a lot when it's not necessary, and b) the likelyhood you won't wait long enough (especially under load, as you mentioned). I'm wondering if something like this would work. When a backend connects to the database, it increments a semaphore associated with that database. The last thing it does when exiting is release the semaphore- which is the backend's way of saying "OK, all done here". The drop database command checks the semaphore- if it still has a non-zero count, it fails rather than dropping the database. A possibly optional argument would have it wait until the semaphore is 0, and then drop the database. This has the advantage of only waiting long enough. No idea how practical this would be, tho... Brian ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Last minute mini-proposal (I know, I know) for PQexecf()
My apologies for the late reply... Tom Lane wrote: <[EMAIL PROTECTED]> writes: I'd like to see a new variant on PQexec(): PGresult * PQexecf(PGconn *conn, const char *fmt, ...); Way too late for 8.3 --- if we were going to do something like this, we should think first and program later. In particular, blindly adopting the sprintf format string definition doesn't seem very helpful. The sorts of escapes I'd want to have are "properly quoted SQL identifier", "properly quoted SQL literal", etc. A large fraction of what sprintf knows about is more or less irrelevant to the task of creating SQL commands. The advantage of using stock sprintf commands is that most compilers understand them these days, and can check that the arguments given match the format string. If you go with your own format specifiers, this is no longer true. Brian
Re: [HACKERS] Money type todos?
Shane Ambler wrote: August Zajonc wrote: Agreed with Tom on this one. Full usage of money is beyond tagged types etc. For example, when you earn money in another currency, it is the time at which you earn it that describes its value. So for P&L accounts there is generally no change in exchange rates over time and you need to track what the rate was at time of earning. Solution is to date earnings and have a table of exchange rates by day. Personally I think a true money type should hold the numeric value and optionally the currency (similar to the timestamp with timezone) and have support functions that handle the i/0 conversion (text - $US1,000.00 - to money) as happens now. As opposed to the db designer storing it in different columns. It'd be nice if there were an easy, standard solution to this problem- but I don't think there is. For example, our application (which would be greatly simplified if there was a standard solution to this) knows of ~200 different currencies, including such standards as the Uganda Shilling, the Zambia Kwacha, and Ethiopian Birr. Not to mention you get situations where goverments (for various reasons) issue "new" currency, like the "new" Zimbabwe Dollar, vr.s the "old" Zimbabwe Dollar. Confuse these two and you'll lose your shirt. Personally, I don't think it's that big of a deal to have to do in my queries: SELECT table.amount || ccy.code FROM table LEFT JOIN lu_currency AS ccy WHERE table.ccy_id = ccy.id to make the report come out as "100USD". Brian ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] tsearch in core patch, for inclusion
Markus Schiltknecht wrote: Hi, I recall having read something about rewriting the parser. Together with Tom being worried about parser performance and knowing GCC has switched to a hand written parser some time ago, I suspected bison to be slow. That's why I've asked. This has little to do with performance and everything to do with the insanity which is C++: http://gnu.teleglobe.net/software/gcc/gcc-3.4/changes.html * A hand-written recursive-descent C++ parser has replaced the YACC-derived C++ parser from previous GCC releases. The new parser contains much improved infrastructure needed for better parsing of C++ source codes, handling of extensions, and clean separation (where possible) between proper semantics analysis and parsing. The new parser fixes many bugs that were found in the old parser. Short form: C++ is basically not LALR(1) parseable. Brian
Re: [HACKERS] tripping an assert in 8.1.6 (more info)
Tom Lane wrote: Brian Hurt <[EMAIL PROTECTED]> writes: Execute, on a fresh database, the following sql, to recreate the bug: Hmm, I thought that stack trace looked a bit familiar --- we seem to have fixed the problem as of 8.2. Unfortunately I can't recall what the relevant change was exactly; time for some digging in the CVS logs. Any hope of getting the fix back-ported into the 8.1 tree? Brian
Re: [HACKERS] tripping an assert in 8.1.6 (more info)
I forgot to mention: core dumps available upon request (obviously I don't want to post them to the list). Brian ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] tripping an assert in 8.1.6 (more info)
More info on that assert I've hit. Compile 8.1.6 with configuration options: ./configure --with-perl --enable-debug --enable-cassert (not sure if --perl is relevent or not, I think not). This is on Fedora Core 5 on x86-32. Execute, on a fresh database, the following sql, to recreate the bug: CREATE TABLE foo ( some_data VARCHAR(32) ,row_date DATE ); CREATE TABLE bar ( some_data VARCHAR(32) ,row_date DATE ); CREATE TABLE quux ( some_data VARCHAR(32) ,more_data VARCHAR(32) ); CREATE OR REPLACE VIEW bazz AS SELECT ('bar: ' || bar.row_date) :: TEXT AS action, quux.more_data AS more_data, bar.row_date AS row_date FROM bar JOIN quux ON bar.some_data = quux.some_data UNION ALL SELECT ('foo: ' || foo.row_date) :: TEXT AS action, quux.more_data AS more_data, foo.row_date AS row_date FROM foo JOIN quux ON foo.some_data = quux.some_data ; SELECT action, more_data FROM bazz; ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] tripping an assert in 8.1.6
Hello all. It seems I'm tripping an assert in 8.1.6- the assert on line 219 of src/backend/executor/execScan.c (found by running gdb on a core dump). This is on x86 and Redhat Linux (forget which version). Note that if I recompile 8.1.6 with asserts turned off the query completes just fine. I'm trying to put together an example which reproduces the problem without requiring half our company's data- that should follow soon. The gdb backtrace is: #0 0xe410 in __kernel_vsyscall () (gdb) bt #0 0xe410 in __kernel_vsyscall () #1 0xb7d2dee9 in raise () from /lib/libc.so.6 #2 0xb7d2f4f1 in abort () from /lib/libc.so.6 #3 0x0824f931 in ExceptionalCondition (conditionName=Variable "conditionName" is not available. ) at assert.c:51 #4 0x081537ac in ExecAssignScanProjectionInfo (node=0x8426bec) at execScan.c:219 #5 0x08161339 in ExecInitSubqueryScan (node=0x8412de4, estate=0x8426ad4) at nodeSubqueryscan.c:212 #6 0x0814e0e4 in ExecInitNode (node=0x8412de4, estate=0x8426ad4) at execProcnode.c:179 #7 0x0814c554 in ExecutorStart (queryDesc=0x842554c, explainOnly=1 '\001') at execMain.c:618 #8 0x081193f5 in ExplainOnePlan (queryDesc=0x842554c, stmt=0x839afe4, tstate=0x83cbdac) at explain.c:243 #9 0x081198ac in ExplainOneQuery (query=0x83b88e4, stmt=0x839afe4, tstate=0x83cbdac) at explain.c:214 #10 0x08119a92 in ExplainQuery (stmt=0x839afe4, dest=0x83b8a54) at explain.c:121 #11 0x081da391 in PortalRunUtility (portal=0x83b67b4, query=0x839b07c, dest=0x83b8a54, completionTag=0x0) at pquery.c:987 #12 0x081db6dc in PortalRun (portal=0x83b67b4, count=2147483647, dest=0x839b030, altdest=0x839b030, completionTag=0xbf9efee8 "") at pquery.c:637 #13 0x081d713c in exec_simple_query ( query_string=0x839a26c "explain SELECT action, bloomberg_code, composite_bloomberg_code, reuters_code, cusip_code, sedol_code, isin_code FROM vw_ca_generic_actions WHERE (action_date >= '20070122'::date) AND (action_date <= "...) at postgres.c:1004 #14 0x081d8bd3 in PostgresMain (argc=4, argv=0x83593f0, username=0x83593b8 "postgres") at postgres.c:3232 #15 0x081aca37 in ServerLoop () at postmaster.c:2865 #16 0x081ad936 in PostmasterMain (argc=3, argv=0x8358560) at postmaster.c:941 #17 0x0816c1c9 in main (argc=3, argv=Cannot access memory at address 0x1515 ) at main.c:265 This is mainly a "heads up- bug incomming" message. Thanks. Brian ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Function execution costs 'n all that
Neil Conway wrote: On Mon, 2007-01-15 at 10:51 -0800, Richard Troy wrote: I therefore propose that the engine evaluate - benchmark, if you will - all functions as they are ingested, or vacuum-like at some later date (when valid data for testing may exist), and assign a cost relative to what it already knows - the built-ins, for example. That seems pretty unworkable. It is unsafe, for one: evaluating a function may have side effects (inside or outside the database), so the DBMS cannot just invoke user-defined functions at whim. Also, the relationship between a function's arguments and its performance will often be highly complex -- it would be very difficult, not too mention computationally infeasible, to reconstruct that relationship automatically, especially without any real knowledge about the function's behavior. Non-developer here, but we use a lot of plpgsql functions at work. And the functions we use fall into two broad, ill-defined catagories- "expensive" functions and "cheap" functions. What I'd like as a user is some way to tell the planner "this function is expensive- prefer plans which call this function less even if they're otherwise more expensive" or "this function is cheap, prefer plans that are otherwise less expensive even if they call this function more often". Precise cost estimates aren't that important, IMHO. Brian
Re: [HACKERS] [PATCHES] [BUGS] BUG #2846: inconsistent and
Bruce Momjian wrote: Tom Lane wrote: Bruce Momjian <[EMAIL PROTECTED]> writes: Tom Lane wrote: This is *not* going in the right direction :-( Well, then show me what direction you think is better. Fewer restrictions, not more. The thrust of what I've been saying (and I think Roman too) is to trust in the hardware float-arithmetic implementation to be right. Every time you add an additional "error check" you are going in the wrong direction. OK, are you saying that there is a signal we are ignoring for overflow/underflow, or that we should just silently overflow/underflow and not throw an error? My understanding is that you have to actually set flags in the floating point environment to make overflows, underflows, infinities, NaNs, etc. raise signals. You might take a look at fenv.h (defined in the C99 spec) for the functions to do this. My apologies if I'm recovering well trod ground here. Note that taking a signal on an FP exception is a horribly expensive proposition- we're talking about hundreds or thousands of clock cycles here. But it's probably worthwhile vr.s the cost of testing every floating point result, as generally FP exceptions will be rare (probably even more rare in database work than in general). So it's probably worthwhile. Brian
Re: [HACKERS] invalid input syntax for type timestamp.
Joachim Wieland wrote: Yes, I completely agree that JDT should not be included. I just wanted to understand how those lines show that JST is still in active use. As far as I understand it, it says that JST was observed from 1948 to 1951 (the second rule) and now there is a time zone "J%sT" (because there is no "until"-date in the last line) but there is no rule that tells us what to replace "%s" with... We're using JST as a time zone here, so I'd like to politely request that JST stay in the list of time zones, wether it is an "official" time zone or not. Thanks. Brian ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] Compressing table images
My apologies if this subject has already been hashed to death, or if this is the wrong list, but I was wondering if people had seen this paper: http://www.cwi.nl/htbin/ins1/publications?request=intabstract&key=ZuHeNeBo:ICDE:06 Basically it describes a compression algorithm for tables of a database. The huge advantage of doing this is that it reduced the disk traffic by (approximately) a factor of four- at the cost of more CPU utilization. Any thoughts or comments? Brian ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq