Re: [HACKERS] session id and global storage
It seems MyProcID is what I was searching for... David Hoksza DH Something like this would be maybe possible, but this select can DH return more rows, when the user is connected with more instances... DH David Hoksza DH Hi, I cant find any function, which tells me something like session id. Is there something like that? I need it in my AM, because I need to know, if something which I wrote in file was written in this current session or previously. AL How about AL select procpid||' '||backend_start from pg_stat_activity; AL Yours, AL Laurenz Albe DH ---(end of DH broadcast)--- DH TIP 5: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] copy with compression progress n
Ühel kenal päeval, K, 2006-05-31 kell 17:31, kirjutas Andreas Pflug: Tom Lane wrote: Andreas Pflug [EMAIL PROTECTED] writes: The attached patch implements COPY ... WITH [BINARY] COMPRESSION (compression implies BINARY). The copy data uses bit 17 of the flag field to identify compressed data. I think this is a pretty horrid idea, because it changes pg_lzcompress from an unimportant implementation detail into a backup file format that we have to support till the end of time. What happens if, say, we need to abandon pg_lzcompress because we find out it has patent problems? It *might* be tolerable if we used gzip instead, I used pg_lzcompress because it's present in the backend. I'm fine with every other good compression algorithm. but I really don't see the argument for doing this inside the server at all: piping to gzip seems like a perfectly acceptable solution, As I said, this hits only if it is possible to pipe the result into gzip in a performant way. The issue already arises if psql or any other COPY client (slony, pg_dump) is not on the same machine: Network bandwidth will limit throughput. Maybe make up a way to pipe COPY result through some external process (like gzip) on the server side without having shell access there. To make it secure, the external process should probably be run from a hardwired directory via chroot. -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] session id and global storage
Hannu Krosing said: Ãhel kenal päeval, N, 2006-06-01 kell 10:10, kirjutas David Hoksza: It seems MyProcID is what I was searching for... On a buzy server with lots of connects, procID will repeat quite often. log_line-prefix has a sessionid gadget: Session ID: A unique identifier for each session. It is 2 4-byte hexadecimal numbers (without leading zeros) separated by a dot. The numbers are the session start time and the process ID Assuming PIDs are not reused within one second that will be unique. cheers andrew ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Possible TODO item: copy to/from pipe
After re-reading what I just wrote to Andreas about how compression of COPY data would be better done outside the backend than inside, it struck me that we are missing a feature that's fairly common in Unix programs. Perhaps COPY ought to have the ability to pipe its output to a shell command, or read input from a shell command. Maybe something like COPY mytable TO '| gzip /home/tgl/mytable.dump.gz'; (I'm not wedded to the above syntax, it's just an off-the-cuff thought.) Of course psql would need the same capability, since the server-side copy would still be restricted to superusers. You can accomplish COPY piping now through psql, but it's a bit awkward: psql -c COPY mytable TO stdout mydb | gzip ... Thoughts? Is this worth doing, or is the psql -c approach good enough? To be honest, I don't see much benefit in it. You can already accomplish what you want to accomplish easily enough. If you want to muck with COPY, I'd like to see it accept a query as: psql -c COPY select * from mytable where foo='bar' TO stdout mydb | gzip ... ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] CTID issues and a soc student in need of help
Hi, I am a Google soc student and in need of some help with PostgreSQL internals: My C function can run (and already running) for a very very long time on some inputs and reiterate on relations using SPI. Basically, I open portals and cursors to relations. Also note that I always open the relations in READ ONLY mode using SPI. A big no no is that some data in the relations would change since that would just ruin everything. I have a great need to identify a tuple uniquely so my prototype uses the CTID field for that purpose. Btw, this identification is required by the algorithm and primary keys are just wasteful and will slow the process considerably (not to mention some relations don't have primary keys). The question is, can the CTID field change throughout the run of my function due to some other processes working on the relation? Or because of command boundaries it is pretty much secured inside an implicit transaction? The problem wasn't so great if I didn't want to exploit indices in the relations (but I do and does), since after you issue a SELECT that uses indices, all you can rely on is the CTID to uniquely identify a tuple. The other solution is to temporarily duplicate the relations but this is less appealing (plus it's already working great with CTIDs). -- Tzahi Fadida Blog: http://tzahi.blogsite.org | Home Site: http://tzahi.webhop.info WARNING TO SPAMMERS: see at http://members.lycos.co.uk/my2nis/spamwarning.html ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] CTID issues and a soc student in need of help
On Thu, Jun 01, 2006 at 03:33:50PM +0300, Tzahi Fadida wrote: The question is, can the CTID field change throughout the run of my function due to some other processes working on the relation? Or because of command boundaries it is pretty much secured inside an implicit transaction? The problem wasn't so great if I didn't want to exploit indices in the relations (but I do and does), since after you issue a SELECT that uses indices, all you can rely on is the CTID to uniquely identify a tuple. The CTID is the location on disk of the tuple, so no, it doesn't change while you are running. However, if you're running in isolation mode read committed, then someone else could update the tuple while you're looking at it. In this case the tuple will appear to vanish and the updated version will appear elsewhere with a new CTID. Whether this is a problem or not depends on what you're using it for. Hope this helps, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
Re: [HACKERS] CTID issues and a soc student in need of help
I am using CTID for the concept of a tuple set. For example, the set of t1 from relation1, t1 from relation2, t10 from relation3 will be represented in my function as a list of (TableID:CTID) pairs. For example {(1:1),(2:1),(3:10)) I then save these in bytea arrays in a tuplestore. This is essential for the full disjunction algorithm because you do not want to recompute joins for every such set using the actual attribute. From the documentation i see that Dirty Read, Nonrepeatable Read and Phantom Read are all unacceptable. Moreover, when i said long time i meant full disjunctions can run for hours on an extremely difficult input (not that most people will want to do that, but for the general case) so it is not realistic to lock the tables for that period. So i have 2 follow up questions: 1) If i execute the function in a serializable isolation level and the function is read only to the tables, is it possible for the function to fail or other updating transactions to either fail or wait for hours/starvation? 2) Inside the function i open and reopen cursors and portals to relations, how can i set once, for all those opening within the function, to have a serializable isolation level. I suspect that because of command boundaries, just running SET TRANSACTION SERIALIZABLE using SPI at the start should be enough. On Thu, 2006-06-01 at 15:30 +0200, Martijn van Oosterhout wrote: The CTID is the location on disk of the tuple, so no, it doesn't change while you are running. However, if you're running in isolation mode read committed, then someone else could update the tuple while you're looking at it. In this case the tuple will appear to vanish and the updated version will appear elsewhere with a new CTID. Whether this is a problem or not depends on what you're using it for. Hope this helps, ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] CTID issues and a soc student in need of help
Tzahi Fadida [EMAIL PROTECTED] writes: I am using CTID for the concept of a tuple set. For example, the set of t1 from relation1, t1 from relation2, t10 from relation3 will be represented in my function as a list of (TableID:CTID) pairs. For example {(1:1),(2:1),(3:10)) I then save these in bytea arrays in a tuplestore. This is essential for the full disjunction algorithm because you do not want to recompute joins for every such set using the actual attribute. I think this is OK within the context of a single SQL command, since tuple visibility should not change for that command. If you were trying to use the CTID info across multiple statements then it'd get worrisome. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] CTID issues and a soc student in need of help
I am not sure about the definition of a context of a single SQL command. Example of a run: A - SELECT getfdr('Relation1,Relation2,Relation3'); to get the result schema (takes a few milliseconds). SELECT * FROM FullDisjunctions('Relation1,Relation2,Relation3') AS RECORD A; Can take a long time. Inside C-language FullDisjunctions() function i repeatedly call, using SPI: SELECT * FROM Relation1; SELECT * FROM Relation2; SELECT * FROM Relation1 WHERE...; SELECT * FROM Relation3; So i call using one SQL to FullDisjunction and subsequent SQL calls in it. I wish that the context of the overall SELECT FullDisjunctions will be perfectly unchanged for all subsequent SQL calls inside it and especially the CTID. p.s.: In a different version of the function i create a temporary relation and insert tuples in it, but it is exclusively used and destroyed by the specific instance of that function. I hope it does not affect anything in the general sense of a read only transaction. 10x! Regards, tzahi. On Thu, 2006-06-01 at 11:28 -0400, Tom Lane wrote: I think this is OK within the context of a single SQL command, since tuple visibility should not change for that command. If you were trying to use the CTID info across multiple statements then it'd get worrisome. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] CTID issues and a soc student in need of help
Tzahi Fadida [EMAIL PROTECTED] writes: I am not sure about the definition of a context of a single SQL command. Well, AFAICS selecting a disjunction ought to qualify as a single SQL command using a single snapshot. It's not that different from a JOIN or UNION operation, no? Inside C-language FullDisjunctions() function i repeatedly call, using SPI: SELECT * FROM Relation1; SELECT * FROM Relation2; SELECT * FROM Relation1 WHERE...; SELECT * FROM Relation3; You would need to force all these operations to be done with the same snapshot; should be possible with SPI_execute_snapshot. But really the above sounds like a toy prototype implementation to me. Why aren't you building this as executor plan-tree machinery? p.s.: In a different version of the function i create a temporary relation and insert tuples in it, but it is exclusively used and destroyed by the specific instance of that function. Why? You could use a tuplestore for transient data. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] More thoughts about planner's cost estimates
Tom, As you know, this is something I think about a bit too, though not nearly as deeply as you. In general it seems to me that for CPU-bound databases, the default values of the cpu_xxx_cost variables are too low. I am tempted to raise the default value of cpu_index_tuple_cost to 0.005, which would be half of cpu_tuple_cost and twice cpu_operator_cost; that seems more in line with my feel for the relative costs of things. For a CPU-bound database all three would need to go up more than that. But rather than telling people to manipulate all three of these variables individually, I think it might also be a good idea to provide a new GUC variable named something like cpu_speed_scale that would just be a multiplier for the other variables. Yes, please! I've done a bit of playing around with the cpu_* variables, and frankly have always manipulated them by the same multiple. Also, my testing has shown that on *large* databases (20x RAM +) with fast processors (Opteron) the cpu_* values are too high. But a lot of that is offsetting estimates which are too high elsewhere ... as is random_page_cost. I've desisted from touching this mainly because the obvious sanity adjustments, such as adding something for tree descent and charging random_page_cost not 1.0 for leaf page touches, would increase the estimated cost of index usage, which seemed like not the direction we wanted to go in. So I figured that the errors were more or less cancelling each other. But the addition of bitmap index scans is now exposing the weaknesses, so we need to face up to the problem. Yeah. I've refrained from proposing changes because it's a pick-up-sticks. If we start modifying the model, we need to fix *everything*, not just one item. And then educate our users that they need to use the GUC variables in a different way. Here's the issues I see: 1. n-distinct estimation is bad, as previously discussed; 2. our current heuristics sampling methods prevent us from sampling more than 0.5% of any reasonably large table, causing all statistics on those tables to be off for any table with irregular distribution of values; 3. We don't have any method to analyze inter-column correlation within a table; 4. We don't keep statistics on foriegn key correlation; 5. random_page_cost (as previously discussed) is actually a funciton of relatively immutable hardware statistics, and as such should not need to exist as a GUC once the cost model is fixed. 6. We haven't added any way to estimate rows returned from SRFs. I am inclined to think that a reasonable model is to continue to estimate the number of index leaf pages touched as above (pro-rating against the total number of index entries), to charge random_page_cost per leaf page touched, and not to count anything for metapage or tree descent. I justify the latter on the grounds that upper tree levels will tend to stay in cache because they're touched so often. random_page_cost per leaf page may be an overestimate, since sometimes logically adjacent leaf pages will be physically adjacent too, but not charging for tree descent should help to cancel that out. With this model, the disk cost to fetch a single index entry will be estimated as random_page_cost (default 4.0) rather than the current fixed 2.0. This shouldn't hurt things too much for simple indexscans --- especially since anyone running with a reduced random_page_cost won't see as much change. And it will increase the costs for bitmap scans that cross many index pages, which is what we need in light of Philippe's numbers. Per above, I think anything involving random_page_cost is the wrong way to go. RPC is a GUC assigned by the DBA on pure guesswork. We should be decreasing its use or eliminating it entirely, not increasing the amount we use it. For btrees, we should be able to accurately estimate the cost of the index access given: a) the depth of the tree; b) the likelyhood that requested values are adjacent; c) whether the index and tables are already in shared_mem, or else (d) and (e) below: d) the probability that the index pages are cached in memory, which is composed of: (1) the frequency with which that index is accessed, modified by (2) whether the index is a fraction of available ram, or larger than ram e) the probability that the relevant table pages are cached in memory, determined by the same two factors. *None* of this should involve a userset parameter (random_page_cost) since as you pointed out months ago, the ration of seq/seek access has remained relatively constant over the past 6 years of HDD and I/O engineering. Sadly, I don't have the math for all of the above but I'm hoping someone (either here or on ACM) does. Also, this would require us to have different *formulas* and not merely different cost factors for different kinds of indexes. I believe that this is something that Jie is already struggling with. Jie? The big
Re: [HACKERS] CTID issues and a soc student in need of help
On Thu, 2006-06-01 at 12:45 -0400, Tom Lane wrote: Tzahi Fadida [EMAIL PROTECTED] writes: I am not sure about the definition of a context of a single SQL command. Well, AFAICS selecting a disjunction ought to qualify as a single SQL command using a single snapshot. It's not that different from a JOIN or UNION operation, no? Yes, it is (at least the current version i am implementing) a one shot computation. It is computed top-down and not bottom-up as regular joins. For example, A natural join B natural join C can be broken down to a left deep plan tree. Full disjunctions cannot be broken into such a thing (in this version) and FD('A,B,C') directly returns a set of results. Inside C-language FullDisjunctions() function i repeatedly call, using SPI: SELECT * FROM Relation1; SELECT * FROM Relation2; SELECT * FROM Relation1 WHERE...; SELECT * FROM Relation3; You would need to force all these operations to be done with the same snapshot; should be possible with SPI_execute_snapshot. But really the above sounds like a toy prototype implementation to me. Why aren't you building this as executor plan-tree machinery? I actually use cursors because i reiterate on the SELECT * FROM Relation1 queries using the FETCH_ALL technique. Hopefully cursors uses something similar to SPI_execute_snapshot? (maybe on READ_ONLY that i use. i see it uses something called ActiveSnapshot) (but for WHERE queries that are intended to exploit indices in the relations i must execute repeatedly). The reason, is two fold. - At this time i don't see any big advantage (aside from the schema) in putting it in the parser and subsequently the executor. - I want to work inside the frame of time for the soc. I think that i should first have a stable contrib module that looks acceptable before i continue to something more problematic to maintain. We have a new paper that was accepted to VLDB yesterday that breaks down the problem into smaller ones + iterators + have polynomial delay that is suited for streaming, hence the possibility for implementing in the planner but it's too complex for soc. Lets have a stable something first. p.s.: In a different version of the function i create a temporary relation and insert tuples in it, but it is exclusively used and destroyed by the specific instance of that function. Why? You could use a tuplestore for transient data. I do use tuplestore, but the other version needs an index and you can't put an index on a tuplestore. Unless, you can give me a hint on how to create a btree/hash index without a relation but that can be stored on disk like tuplestore. I.e. all data is stored in the index. The key is the whole tuple (the array of CTIDs) anyway. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] More thoughts about planner's cost estimates
Josh Berkus josh@agliodbs.com writes: 1. n-distinct estimation is bad, as previously discussed; 2. our current heuristics sampling methods prevent us from sampling more than 0.5% of any reasonably large table, causing all statistics on those tables to be off for any table with irregular distribution of values; I'm convinced these two are more connected than you believe. For distributions of values in a range using small samples is on solid statistical basis. It's the same reason poll results based on only a few hundred people can accurately predict elections in a country with hundreds of millions of voters. However for things like estimating discrete values there's absolutely no solid foundation for these small samples. Your accuracy is going to be simply proportional to the sample size, meaning you can't get anything trustworthy without sampling large enough portions of the table that a full sequential scan would be just as fast. I might be interested in implementing that algorithm that was posted a while back that involved generating good unbiased samples of discrete values. The algorithm was quite clever and well described and paper claimed impressively good results. However it will only make sense if people are willing to accept that analyze will need a full table scan -- at least for tables where the DBA knows that good n_distinct estimates are necessary. 3. We don't have any method to analyze inter-column correlation within a table; 4. We don't keep statistics on foriegn key correlation; Gosh these would be nice but they sound like hard problems. Has anybody even made any headway in brainstorming how to tackle them? 5. random_page_cost (as previously discussed) is actually a funciton of relatively immutable hardware statistics, and as such should not need to exist as a GUC once the cost model is fixed. I don't think that's true at all. Not all hardware is the same. Certainly the need to twiddle this GUC should be drastically reduced if the cache effects are modelled properly and the only excuses left are legitimate hardware differences. -- greg ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] stable snapshot looks outdated
Looking at http://www.postgresql.org/ftp/stable_snapshot/ surely we have acheived stability at least once since 2005-11-26.. :-) Can we get that fixed? -- Robert Treat Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] More thoughts about planner's cost estimates
Greg, I'm convinced these two are more connected than you believe. Actually, I think they are inseparable. I might be interested in implementing that algorithm that was posted a while back that involved generating good unbiased samples of discrete values. The algorithm was quite clever and well described and paper claimed impressively good results. However it will only make sense if people are willing to accept that analyze will need a full table scan -- at least for tables where the DBA knows that good n_distinct estimates are necessary. What about block-based sampling? Sampling 1 in 20 disk pages, rather than 1 in 20 rows, should require siginificantly less scanning, and yet give us a large enough sample for reasonable accuracy. 3. We don't have any method to analyze inter-column correlation within a table; 4. We don't keep statistics on foriegn key correlation; Gosh these would be nice but they sound like hard problems. Has anybody even made any headway in brainstorming how to tackle them? There's no time like the present! Actually, these both seem like fairly straightforward problems storage-wise. The issue is deriving the statistics, for tables with many columns or FKs. 5. random_page_cost (as previously discussed) is actually a funciton of relatively immutable hardware statistics, and as such should not need to exist as a GUC once the cost model is fixed. I don't think that's true at all. Not all hardware is the same. Certainly the need to twiddle this GUC should be drastically reduced if the cache effects are modelled properly and the only excuses left are legitimate hardware differences. OK ... but still, it should become a little knob rather than the big knob it is currently. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] More thoughts about planner's cost estimates
Josh Berkus josh@agliodbs.com writes: Yeah. I've refrained from proposing changes because it's a pick-up-sticks. If we start modifying the model, we need to fix *everything*, not just one item. And then educate our users that they need to use the GUC variables in a different way. Here's the issues I see: [ various deficiencies in our statistics gathering ] Those are all valid points, but pretty much orthogonal to what I'm on about today. The problems I'm thinking about are seen even when the planner's rowcount estimates are dead on (as indeed they mostly were in Philippe's example). 5. random_page_cost (as previously discussed) is actually a funciton of relatively immutable hardware statistics, and as such should not need to exist as a GUC once the cost model is fixed. Well, it'll still exist as a GUC for the same reasons the other ones are GUCs. But I think the main reasons people have needed to twiddle it are (1) their database is completely RAM-resident (where RPC *should* logically be 1.0), or (2) they're trying to compensate for the overestimation of nestloop indexscans. The changes I'm proposing should help with (2). For btrees, we should be able to accurately estimate the cost of the index access given: a) the depth of the tree; Although logically the tree descent *ought* to be a factor, I haven't seen any evidence that we need to take it into account; in fact all the evidence points to the conclusion that we're better off ignoring it. My theory about why is that the upper levels of the tree stay in cache. I have to admit that's just handwaving, but counting additional disk hits to fetch the first index tuple is clearly not the direction the cost model needs to go in. Look again at the examples in Philippe's output: an indexscan fetching 1700 tuples (about 5 leaf pages probably) took 32x longer than one fetching 7 tuples (presumably all on the same leaf page). There's no way that a model that counts an additional couple of page fetches for tree descent is going to arrive at those numbers. And we see this over and over again: incredibly small times to fetch a single index tuple, at least on the average when the index is being hit many times in one query. c) whether the index and tables are already in shared_mem, or else (d) and (e) below: d) the probability that the index pages are cached in memory, which is composed of: (1) the frequency with which that index is accessed, modified by (2) whether the index is a fraction of available ram, or larger than ram e) the probability that the relevant table pages are cached in memory, determined by the same two factors. These would all be nice things to know, but I'm afraid it's pie in the sky. We have no reasonable way to get those numbers. (And if we could get them, there would be another set of problems, namely plan instability: the planner's choices would become very difficult to reproduce.) The big difficulty in modeling cache effects from multiple scans is that we usually don't know how many times the index will be scanned. I think we can work around this by figuring out whether the index has already been scanned in the plan, something we *can* know. So the first scan will be full cost and remaining scans will be fractional cost. Uh, no, because what we're normally thinking about is independent probes into the index with different keys. For a small number of probes you have to figure that those all hit different leaf pages and aren't going to amortize well. As the number of probes goes up, you can start charging fractional costs because it's more likely you're hitting a leaf page you hit recently. The Mackert/Lohman equations do this reasonably well --- it's possible we can find something better, but those seem like a good place to start. The immediate problem is to get an infrastructure in place that gives us some numbers to apply equations to. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] More thoughts about planner's cost estimates
Josh Berkus josh@agliodbs.com writes: However it will only make sense if people are willing to accept that analyze will need a full table scan -- at least for tables where the DBA knows that good n_distinct estimates are necessary. What about block-based sampling? Sampling 1 in 20 disk pages, rather than 1 in 20 rows, should require siginificantly less scanning, and yet give us a large enough sample for reasonable accuracy. a) We already use block based sampling to reduce overhead. If you're talking about using the entire block and not just randomly sampled tuples from within those blocks then your sample will be biased. b) It *still* wouldn't give you reasonable accuracy for n_distinct estimates. Your probability of being accurate for discrete processes like n_distinct is going to be more or less proportional to your sample. So sampling 5% of the table gives you only a 5% of the chance of an accurate prediction. More or less. Actually, these both seem like fairly straightforward problems storage-wise. The issue is deriving the statistics, for tables with many columns or FKs. Well there are problems on many levels: 1) You have n^2 possible two-column combinations. That's a lot of processing and storage. 2) It isn't even clear what data you're exactly looking for. Certainly correlation is just shorthand here and isn't what you're actually looking for. You need something like a complete histogram for the dependent column for every possible value of the forcing column. That would be an enormous amount of storage. OK ... but still, it should become a little knob rather than the big knob it is currently. Sure, the more accurately you model the cache the less need there would be to adjust random_page_cost. I'm not sure how accurate Postgres can really get unless it switches philosophies towards using O_DIRECT and doing it's own caching and scheduling. But it could certainly be much better than today. Tom's comment makes it look like there's hope for pretty accurate cache modelling for nested loop plans. -- greg ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Generalized concept of modules
On Wed, May 31, 2006 at 05:33:44PM -0400, Tom Lane wrote: Martijn van Oosterhout kleptog@svana.org writes: While you do have a good point about non-binary modules, our module handling need some help IMHO. For example, the current hack for CREATE LANGUAGE to fix things caused by old pg_dumps. I think that's the totally wrong approach long term, I think the pg_dump shouldn't be including the CREATE LANGUAGE statement at all, but should be saying something like INSTALL plpgsql and pg_restore works out what is needed for that module. There's a lot to be said for this, but I keep having the nagging feeling that people are equating module with shared library, which seems far from sufficiently general. I'd like to see module mean an arbitrary collection of SQL objects. I agree that module is often used interchangably with shared library. We need to handle the other case too. It would be a lot easier if we had an example of an SQL only module, since contrib doesn't appear to have one (at first glance anyway). So I think the raw definition sought by your INSTALL would always be a SQL script, and any shared libs that might come along with that are secondary. The idea of using pg_depend to manage UNINSTALL is an excellent one. Well, in that case I'd like to give some concrete suggestions: 1. The $libdir in future may be used to find SQL scripts as well as shared libraries. They'll have different extensions so no possibility of conflict. 2. Create something like BEGIN MODULE xxx which starts a transaction and marks any objects created within it as owned by module xxx. I think it should be tied to a transaction level to avoid half installed things, but maybe people would prefer it to work more like schemas. pg_module system catalog. You'd need this anyway since there has to be some representation of the module object in the catalogs for its component objects to have pg_depend dependencies on. Ack. Owned by in the above sense means that the object depends on the module. You could do it the other way round (module depends on object) but that makes it harder to change things manually. DROP MODULE would work easier too. Let's see, I guess pg_dump would have to be taught to ignore any objects that it can see are directly dependent on a module object. What about indirect dependencies though? The exact semantics don't seem clear to me. At a base level, you could definitly drop the functions. Dropping types is harder because columns might be using them. Normally we use CASCADE to specify that. Also, this seems to be getting into territory that Oracle has already trod --- someone should study exactly what they do for PL/SQL modules and whether we want to be compatible or not. Perhaps there's even something in SQL2003 about it? Probably a good idea... Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
Re: [HACKERS] More thoughts about planner's cost estimates
Greg, Tom, a) We already use block based sampling to reduce overhead. If you're talking about using the entire block and not just randomly sampled tuples from within those blocks then your sample will be biased. There are actually some really good equations to work with this, estimating both row correlation and n-distinct based on sampling complete blocks. See prior discussions on this list on N-distinct. 1) You have n^2 possible two-column combinations. That's a lot of processing and storage. Yes, that's the hard problem to solve. Actually, btw, it's n!, not n^2. 2) It isn't even clear what data you're exactly looking for. Certainly correlation is just shorthand here and isn't what you're actually looking for. Actually, I'd think that a correlation number estimate (0 = complete uncorrelated, 1 = completely correlated) would be sufficient to improve row count estimation significantly, without incurring the vast overhead of histogramXhistorgram manipulation. Those are all valid points, but pretty much orthogonal to what I'm on about today. The problems I'm thinking about are seen even when the planner's rowcount estimates are dead on (as indeed they mostly were in Philippe's example). OK, I was afraid they were interdependant. You would know better than me. Well, it'll still exist as a GUC for the same reasons the other ones are GUCs. But I think the main reasons people have needed to twiddle it are (1) their database is completely RAM-resident (where RPC *should* logically be 1.0), or (2) they're trying to compensate for the overestimation of nestloop indexscans. The changes I'm proposing should help with (2). Right. What I'm saying is that (1) should be derived from estimated_cache_size and DBSIZE, not by setting an additional GUC. (1) the frequency with which that index is accessed, modified by (2) whether the index is a fraction of available ram, or larger than ram e) the probability that the relevant table pages are cached in memory, determined by the same two factors. These would all be nice things to know, but I'm afraid it's pie in the sky. We have no reasonable way to get those numbers. (And if we could get them, there would be another set of problems, namely plan instability: the planner's choices would become very difficult to reproduce.) Hmmm ... I think those numbers are possible to infer, at least on a gross level. Whether the cost of calculating them outweighs the benefit of having them is another question, resolvable only through some experimentation. a good place to start. The immediate problem is to get an infrastructure in place that gives us some numbers to apply equations to. No arguments, of course. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Generalized concept of modules
Martijn van Oosterhout kleptog@svana.org writes: Well, in that case I'd like to give some concrete suggestions: 1. The $libdir in future may be used to find SQL scripts as well as shared libraries. They'll have different extensions so no possibility of conflict. No, it needs to be a separate directory, and the reason is that SQL scripts are architecture-independent and belong under the share/ directory not the lib/ directory. This is a minor point, but the packagers will scream at us if we get it wrong. 2. Create something like BEGIN MODULE xxx which starts a transaction and marks any objects created within it as owned by module xxx. I think it should be tied to a transaction level to avoid half installed things, but maybe people would prefer it to work more like schemas. I think I'd sooner keep it decoupled from transactions, which means you need both a BEGIN MODULE xxx and an END MODULE xxx. Also, a module developer might want to go back and add more stuff to an existing module. So there should be separate commands: CREATE MODULE xxx; BEGIN MODULE xxx; ... anything created here belongs to the module END MODULE xxx; An alternative possibility is to make it work like schemas: you set a GUC variable to indicate that created objects should be associated with that module. This might be the best choice since it avoids having two very different ways of doing very similar things. (Come to think of it, what is the relation between modules and schemas anyway? Should a module's objects be required to be in schemas also owned by the module? If not, what happens when you try to INSTALL a module whose objects lived in a schema you don't have? This gets back to the fact that we don't have a very nice answer for installing existing contrib modules into user-chosen schemas.) Let's see, I guess pg_dump would have to be taught to ignore any objects that it can see are directly dependent on a module object. What about indirect dependencies though? The exact semantics don't seem clear to me. At a base level, you could definitly drop the functions. Dropping types is harder because columns might be using them. Normally we use CASCADE to specify that. I think we're talking at cross purposes. It sounds like you're thinking about how do I remove one object in a module? AFAICS you just drop it. What I was wondering was what is pg_dump's behavior. ISTM we want two modes: * normal behavior: dump module objects as INSTALL foo commands. Do not dump any objects that are owned by a module; assume they will be created by the INSTALL. Use the dependency chains to make sure that INSTALL commands are ordered properly relative to everything else. * dump module foo: dump the module object as a CREATE MODULE command, and then dump creation commands for all the objects that are owned by it. Ignore all else. This is an easy way to generate an updated module definition script. Something that's not at all clear to me is object ownership. Should objects belonging to a module be marked as being owned by the person executing INSTALL, or should module dump/install try to preserve original ownership? I think in most scenarios trying to preserve the original ownership is wrong, since commonly INSTALL will be used to transfer objects to new databases where the original owner might not exist at all. But it's a bit non-orthogonal. Also, what privileges are needed to execute either CREATE MODULE or INSTALL? Conservative design would say superuser-only, but I can't put my finger on any specific reason for that, at least if none of the contained objects need superuser privs to create. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] More thoughts about planner's cost estimates
Greg, 1) You have n^2 possible two-column combinations. That's a lot of processing and storage. Yes, that's the hard problem to solve. Actually, btw, it's n!, not n^2. Ooops, bad math. Andrew pointed out it's actually n*(n-1)/2, not n!. Also, we could omit columns unlikely to correlate, such as large text columns, bytea and numerics with high precisions. Also, we probably don't need to correlate UNIQUE columns inside ... I think. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] CVS-Unknown buildfarm failures?
meerkat and snake both have persistent CVS-Unknown failures in some but not all branches. I can't see any evidence of an actual failure in their logs though. What I do see is ? entries about files that shouldn't be there --- for instance, meerkat apparently needs a make distclean. If that's what's causing the failure report, could we get the buildfarm to show a more useful status message? I'd always assumed that CVS-Unknown suggested a transient problem such as connection loss, and there wasn't any need for human intervention. A more radical answer is to have the script go ahead and delete the offending files itself, but I can see where that might not have good fail-soft behavior ... regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] CVS-Unknown buildfarm failures?
Tom Lane wrote: meerkat and snake both have persistent CVS-Unknown failures in some but not all branches. I can't see any evidence of an actual failure in their logs though. What I do see is ? entries about files that shouldn't be there --- for instance, meerkat apparently needs a make distclean. If that's what's causing the failure report, could we get the buildfarm to show a more useful status message? I'd always assumed that CVS-Unknown suggested a transient problem such as connection loss, and there wasn't any need for human intervention. A more radical answer is to have the script go ahead and delete the offending files itself, but I can see where that might not have good fail-soft behavior ... I have manually ran a dist-clean on meerkat for 8_0 and 8_1 and am rerunning the builds now. Joshua D. Drake regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] More thoughts about planner's cost estimates
On Thu, Jun 01, 2006 at 02:25:56PM -0400, Greg Stark wrote: Josh Berkus josh@agliodbs.com writes: 1. n-distinct estimation is bad, as previously discussed; 2. our current heuristics sampling methods prevent us from sampling more than 0.5% of any reasonably large table, causing all statistics on those tables to be off for any table with irregular distribution of values; I'm convinced these two are more connected than you believe. For distributions of values in a range using small samples is on solid statistical basis. It's the same reason poll results based on only a few hundred people can accurately predict elections in a country with hundreds of millions of voters. However for things like estimating discrete values there's absolutely no solid foundation for these small samples. Your accuracy is going to be simply proportional to the sample size, meaning you can't get anything trustworthy without sampling large enough portions of the table that a full sequential scan would be just as fast. I might be interested in implementing that algorithm that was posted a while back that involved generating good unbiased samples of discrete values. The algorithm was quite clever and well described and paper claimed impressively good results. However it will only make sense if people are willing to accept that analyze will need a full table scan -- at least for tables where the DBA knows that good n_distinct estimates are necessary. There *might* be an alternative Suppose that the executor could keep track of what values it's seeing from a table as it's actually running a query. These could be used to build statistical information without actually running analyze. Of course, the problem is that you could end up with some seriously skewed statistics, depending on what your query patterns are. But in some ways that's not bad; you're optimizing for the queries you most often run. One possible way to handle this would be to keep track of how many times each value has been seen, as well as when it was last seen, so you have some idea of the quality of the data. Another idea is to somehow blend these stats with the traditional method. 3. We don't have any method to analyze inter-column correlation within a table; 4. We don't keep statistics on foriegn key correlation; Gosh these would be nice but they sound like hard problems. Has anybody even made any headway in brainstorming how to tackle them? A *huge* improvement would be gathering statistics on all multi-column indexes. Some of the stats might not make sense in this context, but others (such as correlation) would. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] More thoughts about planner's cost estimates
On Thu, Jun 01, 2006 at 03:15:09PM -0400, Tom Lane wrote: These would all be nice things to know, but I'm afraid it's pie in the sky. We have no reasonable way to get those numbers. (And if we could get them, there would be another set of problems, namely plan instability: the planner's choices would become very difficult to reproduce.) Speaking of plan instability, something that's badly needed is the ability to steer away from query plans that *might* be the most optimal, but also will fail horribly should the cost estimates be wrong. People generally don't care about getting the absolutely most optimal plan; they do care about NOT getting a plan that's horribly bad. One possible way to do this would be to have the estimator calculate a worst-case cost to go along with the best case cost, or maybe even 3 numbers: ideal, what we think will happen, and worst-case. I know that index scans already have that information, at least in the form of cost for a plan with 0 correlation and one with perfect correlation. Does anyone have any experience developing genetic algorithms? I think coming up with cost estimators might be an ideal use for that technology... -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] More thoughts about planner's cost estimates
Josh Berkus josh@agliodbs.com writes: Greg, Tom, a) We already use block based sampling to reduce overhead. If you're talking about using the entire block and not just randomly sampled tuples from within those blocks then your sample will be biased. There are actually some really good equations to work with this, estimating both row correlation and n-distinct based on sampling complete blocks. See prior discussions on this list on N-distinct. In the prior discussions someone posted the paper with the algorithm I mentioned. That paper mentions that previous work showed poor results at estimating n_distinct even with sample sizes as large as 50% or more. 1) You have n^2 possible two-column combinations. That's a lot of processing and storage. Yes, that's the hard problem to solve. Actually, btw, it's n!, not n^2. You have O(n^2) possible *two-column* combinations and O(n!) n-column combinations. I was hoping two-column combinations would be enough information to extrapolate from for larger combinations. 2) It isn't even clear what data you're exactly looking for. Certainly correlation is just shorthand here and isn't what you're actually looking for. Actually, I'd think that a correlation number estimate (0 = complete uncorrelated, 1 = completely correlated) would be sufficient to improve row count estimation significantly, without incurring the vast overhead of histogramXhistorgram manipulation. No, correlation is actually quite uninteresting here. Consider for example two columns state and area code. The area code is pretty much 100% dependent on state, but will show very little correlation. Similarly things like product name and catalog number or even author and genre would be problem cases but have little correlation. And you can easily imagine queries that specify restrictive clauses on two such columns for which the existing statistics will overestimate the selectivity because it assumes no interdependency. Hopefully you're right that you don't need complete histograms. Perhaps there's some statistics concept they don't teach in stats 101 that would cover this need. What we're looking for is a function f(a,b,x) that gives the net selectivity given a and b, the selectivity of two clauses based on two columns, and x some simple value that can be easily calculated by looking at the data in advance. -- greg ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] More thoughts about planner's cost estimates
On Thu, Jun 01, 2006 at 08:36:16PM -0400, Greg Stark wrote: Josh Berkus josh@agliodbs.com writes: Greg, Tom, a) We already use block based sampling to reduce overhead. If you're talking about using the entire block and not just randomly sampled tuples from within those blocks then your sample will be biased. There are actually some really good equations to work with this, estimating both row correlation and n-distinct based on sampling complete blocks. See prior discussions on this list on N-distinct. In the prior discussions someone posted the paper with the algorithm I mentioned. That paper mentions that previous work showed poor results at estimating n_distinct even with sample sizes as large as 50% or more. Which paper? People have referenced several different ones. 1) You have n^2 possible two-column combinations. That's a lot of processing and storage. Yes, that's the hard problem to solve. Actually, btw, it's n!, not n^2. You have O(n^2) possible *two-column* combinations and O(n!) n-column combinations. I was hoping two-column combinations would be enough information to extrapolate from for larger combinations. The math nerd in me says that this is bad math, but it might work well enough by ass-u-ming a lack of higher-order correllations. 2) It isn't even clear what data you're exactly looking for. Certainly correlation is just shorthand here and isn't what you're actually looking for. Actually, I'd think that a correlation number estimate (0 = complete uncorrelated, 1 = completely correlated) would be sufficient to improve row count estimation significantly, without incurring the vast overhead of histogramXhistorgram manipulation. No, correlation is actually quite uninteresting here. Consider for example two columns state and area code. The area code is pretty much 100% dependent on state, but will show very little correlation. There are quantitative tests of independence available. I'm not sure whether any of these have been applied even theoretically in DBMS-land. Hopefully you're right that you don't need complete histograms. Perhaps there's some statistics concept they don't teach in stats 101 that would cover this need. What we're looking for is a function f(a,b,x) that gives the net selectivity given a and b, the selectivity of two clauses based on two columns, and x some simple value that can be easily calculated by looking at the data in advance. That would be neat :) Cheers, D -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 415 235 3778AIM: dfetter666 Skype: davidfetter Remember to vote! ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] More thoughts about planner's cost estimates
Tom Lane wrote: Another thing that's bothering me is that the index access cost computation (in genericcostestimate) is looking sillier and sillier: /* * Estimate the number of index pages that will be retrieved. * * For all currently-supported index types, the first page of the index is * a metadata page, and we should figure on fetching that plus a pro-rated * fraction of the remaining pages. */ if (index-pages 1 index-tuples 0) { numIndexPages = (numIndexTuples / index-tuples) * (index-pages - 1); numIndexPages += 1;/* count the metapage too */ numIndexPages = ceil(numIndexPages); } else numIndexPages = 1.0; /* * Compute the index access cost. * * Disk cost: our generic assumption is that the index pages will be read * sequentially, so they have cost 1.0 each, not random_page_cost. */ *indexTotalCost = numIndexPages; There are several things wrong with this, at least in its application to btrees. It's not counting descent of the btree (ie, touches of the root page and intermediate-level pages). On the other hand it's surely overcharging for metapage touches. As of CVS tip we cache the metapage in the relcache, so it's probably fair to disregard that cost altogether. And on the third hand, when we need to retrieve multiple leaf pages it's over-optimistic to assume that those'll be purely sequential fetches. (That would be approximately true in a freshly-built btree, but not in one that's suffered any amount of page splitting or recycling.) I am inclined to think that a reasonable model is to continue to estimate the number of index leaf pages touched as above (pro-rating against the total number of index entries), to charge random_page_cost per leaf page touched, and not to count anything for metapage or tree descent. I justify the latter on the grounds that upper tree levels will tend to stay in cache because they're touched so often. random_page_cost per leaf page may be an overestimate, since sometimes logically adjacent leaf pages will be physically adjacent too, but not charging for tree descent should help to cancel that out. With this model, the disk cost to fetch a single index entry will be estimated as random_page_cost (default 4.0) rather than the current fixed 2.0. This shouldn't hurt things too much for simple indexscans --- especially since anyone running with a reduced random_page_cost won't see as much change. And it will increase the costs for bitmap scans that cross many index pages, which is what we need in light of Philippe's numbers. This sounds good to me, although the 2.0 - 4.0 cost jump may cause (more) cases of people seeing their index scans in pre-8.2 versions becoming some other type of access in 8.2. I guess a comment about testing existing applications could be placed in the 8.2 release notes? Now we have seen a lot of cases in which indexscans were being drastically overestimated, so increasing the cost estimate even more may seem like a horrid idea. But I believe that most or all of these cases were ones where the same index was being visited many times, and so the real estimation failure is not accounting for cache effects across multiple indexscans. Rather than being afraid to raise the cost estimate for an indexscan in isolation, I think it's time to bite the bullet and do something about that. The big difficulty in modeling cache effects from multiple scans is that we usually don't know how many times the index will be scanned. If we did have that number, I think it'd be reasonable to use the Mackert and Lohman approximation already used in cost_index to estimate the total number of index leaf pages fetched over N scans, and then divide by N to get our per-scan estimated cost. But because the planner builds cost estimates bottom-up, in most scenarios we don't have any idea whether an indexscan plan node will be iterated once or many times in the finished plan. However there is one case where we do have some information: when computing a best_inner_indexscan() for a nestloop join, it'd be reasonable to use the estimated size of the outer relation as the loop count. And this is exactly the case where we are falling down in practice: the complaints we've seen are mostly about overestimating the cost of a nestloop-with-inner-index-scan. So I think handling just this case would be a big step forward, even if it's not a theoretically pure general solution. If I understand correctly, this is the case were we normally see folks needing add several 'set enable_xxx=off' statements to get the nested loop plan that *actually* works best :-). So, also looks good! Cheers Mark ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] More thoughts about planner's cost estimates
Mark Kirkwood [EMAIL PROTECTED] writes: Tom Lane wrote: With this model, the disk cost to fetch a single index entry will be estimated as random_page_cost (default 4.0) rather than the current fixed 2.0. This shouldn't hurt things too much for simple indexscans --- especially since anyone running with a reduced random_page_cost won't see as much change. And it will increase the costs for bitmap scans that cross many index pages, which is what we need in light of Philippe's numbers. This sounds good to me, although the 2.0 - 4.0 cost jump may cause (more) cases of people seeing their index scans in pre-8.2 versions becoming some other type of access in 8.2. I guess a comment about testing existing applications could be placed in the 8.2 release notes? Yeah, that comes with the territory. One point to note is that with this model, setting random_page_cost below 2.0 will actually make small indexscans look *cheaper* than they do now. So it'll certainly be possible to make the thing jump in that direction if you need to. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] 'CVS-Unknown' buildfarm failures?
Tom Lane said: meerkat and snake both have persistent CVS-Unknown failures in some but not all branches. I can't see any evidence of an actual failure in their logs though. What I do see is ? entries about files that shouldn't be there --- for instance, meerkat apparently needs a make distclean. If that's what's causing the failure report, could we get the buildfarm to show a more useful status message? I'd always assumed that CVS-Unknown suggested a transient problem such as connection loss, and there wasn't any need for human intervention. A more radical answer is to have the script go ahead and delete the offending files itself, but I can see where that might not have good fail-soft behavior ... cvs-unknown means there are unknown files in the repo: my $unknown_files = grep {/^\?/ } @cvslog; ... send_result('CVS-Unknown',$unknown_files,[EMAIL PROTECTED]) if ($unknown_files); This is almost always a case of operator error. buildfarm only ever builds in a copy of the repo, not in the permanent repo itself, so there should NEVER be any file there which does not come from CVS. I have repeatedly advised buildfarm member owners not to build by hand in the buildfarm repos. Not everybody listens, apparently. All this is intended to ensure that we are actually working on a faithful reflection of the postgresql.org repo, and not something that has been mangled somehow. I can call it CVS-Unknown-Files if that will make it clearer. cheers andrew ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] 'CVS-Unknown' buildfarm failures?
Andrew Dunstan [EMAIL PROTECTED] writes: Tom Lane said: meerkat and snake both have persistent CVS-Unknown failures in some but not all branches. I can't see any evidence of an actual failure in their logs though. cvs-unknown means there are unknown files in the repo: Oh. Well, it needs renamed then ;-). Per our message style guidelines, calling an error unknown is seldom a good idea. I can call it CVS-Unknown-Files if that will make it clearer. Maybe CVS-Extraneous-Files? regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] Going for all green buildfarm results
I've been making another pass over getting rid of buildfarm failures. The remaining ones I see at the moment are: firefly HEAD: intermittent failures in the stats test. We seem to have fixed every other platform back in January, but not this one. kudu HEAD: one-time failure 6/1/06 in statement_timeout test, never seen before. Is it possible system was under enough load that the 1-second timeout fired before control reached the exception block? tapir HEAD: pilot error, insufficient SysV shmem settings carp various: carp seems to have *serious* hardware problems, as it has been failing randomly in all branches for a long time. I suggest putting that poor machine out to pasture. penguin 8.0: fails in tsearch2. Previous investigation says that the failure is unfixable without initdb, which we are not going to force for 8.0 branch. I suggest retiring penguin from checking 8.0, as there's not much point in continuing to see a failure there. Or is it worth improving buildfarm to be able to skip specific tests? penguin 7.4: fails in initdb, with what seems to be a variant of the alignment issue that kills tsearch2 in 8.0. We won't fix this either, so again might as well stop tracking this branch on this machine. cobra, stoat, sponge 7.4: pilot error. Either install Tk or configure --without-tk. firefly 7.4: dblink test fails, with what looks like an rpath problem. Another one that we fixed awhile ago, and the fix worked on every platform but this one. firefly 7.3: trivial regression diffs; we could install variant comparison files if anyone cared. cobra, stoat, caribou 7.3: same Tk configuration error as in 7.4 branch Firefly is obviously the outlier here. I dunno if anyone cares enough about SCO to spend time investigating it (I don't). Most of the others just need a little bit of attention from the machine owner. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster