[HACKERS] strange cost for correlated subquery
Hello I tested speed SELF JOIN and correlated subquery for couting of subtotals: It's strange, so correlated subqueries is faster, but it has much higher cost: postgres=# explain analyze select t1.id, t1.sale_date, t1.product, t1.sale_price, sum(t2.sale_price) from history t1 inner join history t2 on t1.id = t2.id and t1.product = t2.product group by t1.id, t1.sale_date, t1.product, t1.sale_price order by t1.id ; QUERY PLAN Sort (cost=3678.85..3691.36 rows=5003 width=19) (actual time=1553.575..1560.618 rows=5003 loops=1) Sort Key: t1.id Sort Method: quicksort Memory: 480kB - HashAggregate (cost=3308.91..3371.45 rows=5003 width=19) (actual time=1530.276..1540.206 rows=5003 loops=1) - Nested Loop (cost=0.00..1708.29 rows=128050 width=19) (actual time=0.264..1034.048 rows=198333 loops=1) - Seq Scan on history t1 (cost=0.00..78.03 rows=5003 width=15) (actual time=0.077..8.835 rows=5003 loops=1) - Index Scan using fxxx on history t2 (cost=0.00..0.31 rows=1 width=11) (actual time=0.017..0.087 rows=40 loops=5003) Index Cond: (((t2.product)::text = (t1.product)::text) AND (t1.id = t2.id)) Total runtime: 1567.125 ms (9 rows) postgres=# explain analyze SELECT sale_date, product, sale_price, COALESCE((SELECT SUM(sale_price) FROM history WHERE product = o.product AND id = o.id), 0) AS total FROM history o; QUERY PLAN Seq Scan on history o (cost=0.00..41532.29 rows=5003 width=15) (actual time=0.073..825.333 rows=5003 loops=1) SubPlan - Aggregate (cost=8.28..8.29 rows=1 width=4) (actual time=0.158..0.159 rows=1 loops=5003) - Index Scan using fxxx on history (cost=0.00..8.27 rows=1 width=4) (actual time=0.018..0.086 rows=40 loops=5003) Index Cond: (((product)::text = ($0)::text) AND (id = $1)) Total runtime: 833.213 ms (6 rows) postgres=# show effective_cache_size ; effective_cache_size -- 600MB (1 row) postgres=# SHOW shared_buffers ; shared_buffers 300MB (1 row) Maybe I have too big random_page_cost? postgres=# SHOW random_page_cost ; random_page_cost -- 4 (1 row) Time: 0,351 ms postgres=# set random_page_cost to 2; SET Time: 0,330 ms postgres=# SHOW random_page_cost ; random_page_cost -- 2 (1 row) Time: 0,320 ms postgres=# explain analyze SELECT sale_date, product, sale_price, COALESCE((SELECT SUM(sale_price) FROM history WHERE product = o.product AND id = o.id), 0) AS total FROM history o; QUERY PLAN Seq Scan on history o (cost=0.00..21518.09 rows=5003 width=15) (actual time=0.132..809.701 rows=5003 loops=1) SubPlan - Aggregate (cost=4.28..4.29 rows=1 width=4) (actual time=0.154..0.155 rows=1 loops=5003) - Index Scan using fxxx on history (cost=0.00..4.27 rows=1 width=4) (actual time=0.020..0.088 rows=40 loops=5003) Index Cond: (((product)::text = ($0)::text) AND (id = $1)) Total runtime: 817.358 ms Regards Pavel Stehule -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Commit fest?
Bruce Momjian [EMAIL PROTECTED] writes: I'm curious where the comments are being stored, since it's not in the html source. The javascript must be pulling them from another url? The comments are stored at JS-Kit: http://js-kit.com/comments/ It's stored in their server? Well, I can't have @ in the URL because it is usually forbidden by browsers for phishing protection, so I was converting @ to '.' anyway. What I have now done is display the real message id and MD5 permanent link at the top of each message, e.g.: http://momjian.us/mhonarc/patches/msg00054.html Any chance we could put that on the actual listing page somehow. perhaps in a tiny font?? I want to be able to copy the thread and get enough information for future reference. Also, any chance you could use the permanent urls in the thread listing? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Commit fest?
Gregory Stark wrote: Bruce Momjian [EMAIL PROTECTED] writes: I'm curious where the comments are being stored, since it's not in the html source. The javascript must be pulling them from another url? The comments are stored at JS-Kit: http://js-kit.com/comments/ It's stored in their server? Yes, that was the beauty of it --- I just add javascript with a tag and all comments are handled by them. Well, I can't have @ in the URL because it is usually forbidden by browsers for phishing protection, so I was converting @ to '.' anyway. What I have now done is display the real message id and MD5 permanent link at the top of each message, e.g.: http://momjian.us/mhonarc/patches/msg00054.html Any chance we could put that on the actual listing page somehow. perhaps in a tiny font?? I want to be able to copy the thread and get enough information for future reference. You want the message-id on the listing page? Sure, I was doing that before but I didn't know anyone wanted it and it looked a little cluttered. Let me know. Also, any chance you could use the permanent urls in the thread listing? Uh, that would be a little tricky because the next/previous wants to go by message number increment, I think. I could post-process all the HTML files to do a search/replace. I think the bigger problem is the threads move around on the thread pages, and that seems impossible to fix as items are added and removed. -- Bruce Momjian [EMAIL PROTECTED]http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Commit fest?
Bruce Momjian [EMAIL PROTECTED] writes: You want the message-id on the listing page? Sure, I was doing that before but I didn't know anyone wanted it and it looked a little cluttered. Let me know. I agree it was too cluttered for normal use. What I'm trying to do is get a page which has the message-id's of all the messages and the comments on the same page. That way I can dump the data into a text file to experiment with. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support! -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Commit fest?
Bruce Momjian [EMAIL PROTECTED] writes: Gregory Stark wrote: Bruce Momjian [EMAIL PROTECTED] writes: The comments are stored at JS-Kit: http://js-kit.com/comments/ It's stored in their server? Yes, that was the beauty of it --- I just add javascript with a tag and all comments are handled by them. Beauty? I don't think we want to rely on non-project-controlled servers for anything that's part of our core infrastructure. If/when js-kit.com goes belly-up, what happens to that data? Also, what kind of privacy guarantees have we got? (Admittedly, privacy may be moot for information that was originally entered on a public web page, but the whole idea of someone else controlling our data just makes me itch.) I can go along with this as a jury-rig setup for our first commit fest, but it just seems like another powerful argument for moving to something wiki-based as soon as we can get that sorted. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strange cost for correlated subquery
Pavel Stehule [EMAIL PROTECTED] writes: It's strange, so correlated subqueries is faster, but it has much higher cost: In the nestloop plan, the estimated cost for the indexscan is discounted based on the knowledge that it'll be executed repeatedly: - Index Scan using fxxx on history t2 (cost=0.00..0.31 rows=1 width=11) (actual time=0.017..0.087 rows=40 loops=5003) In the subplan case that doesn't happen: - Index Scan using fxxx on history (cost=0.00..8.27 rows=1 width=4) (actual time=0.018..0.086 rows=40 loops=5003) Index Cond: (((product)::text = ($0)::text) AND (id = $1)) Had the same discount been applied then the estimated costs would be pretty nearly in line with reality, if I did the math right. It'd be nice to do better but I'm not sure how; at the time that we create plans for sub-queries we don't really have any way to know how often they'll be called by the upper query. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Single table forcing sequential scans on query plans
Cristian Gafton [EMAIL PROTECTED] writes: I have a weird query execution plan problem I am trying to debug on Postgresql 8.2.6. I have a query that joins against a temporary table that has very few rows. Is it possible that the temp table ever has exactly zero rows? My questions are: - what would make the analyze operation fail in the eyes of the planner? - why joining to a single unanalyzed table disables any and all indexes from the other tables references in the query? That's entirely the wrong way to think about it. The planner is choosing a good plan based on its estimates of table sizes, which are wildly different in the two cases: - Seq Scan on tmpinstanceid (cost=0.00..1.02 rows=2 width=8) (actual time=0.005..0.007 rows=2 loops=1) - Seq Scan on tmpinstanceid (cost=0.00..29.40 rows=1940 width=8) If there actually were nearly 2000 rows in the temp table, that nested-loops plan would take about a thousand times longer than it does, and you'd not be nearly so pleased with it. The merge-and-hash-joins plan looks quite sane to me for that table size. The larger estimate is coming from some heuristics that are applied when the table size recorded in pg_class.relpages reltuples is exactly zero. It's intentionally not small, to keep us from choosing a plan with brittle performance behavior when we are looking at a table that's never been vacuumed or analyzed. The only idea I have for how the planner could ignore a previous analyze result is if the analyze found the table to be of zero size. Then the heuristic would still be applied because relpages == 0. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Hash index build patch has *worse* performance at small table sizes
I've been reviewing the hash index build patch submitted here: http://archives.postgresql.org/pgsql-patches/2007-10/msg00154.php Although it definitely helps on large indexes, it's actually counterproductive on not-so-large ones. The test case I'm using is random integers generated like this: create table foo as select (random() * N)::int as f1 from generate_series(1,N); select count(*) from foo; -- force hint bit updates checkpoint; then timing create index fooi on foo using hash(f1); Using all-default configuration settings on some not-very-new hardware, at N = 1E6 I see 8.3.1: 30 sec With pre-expansion of index (CVS HEAD): 24 sec With sorting: 72 sec To build a btree index on same data:34 sec Now this isn't amazingly surprising, because the original argument for doing sorting was to improve locality of access to the index during the build, and that only matters if you've got an index significantly bigger than memory. If the index fits in RAM then the sort is pure overhead. The obvious response to this is to use the sorting approach only when the estimated index size exceeds some threshold. One possible choice of threshold would be shared_buffers (or temp_buffers for a temp index) but I think that is probably too conservative, because in most scenarios the kernel's disk cache is available too. Plus you can't tweak that setting without a postmaster restart. I'm tempted to use effective_cache_size, which attempts to measure an appropriate number and can be set locally within the session doing the CREATE INDEX if necessary. Or we could invent a new GUC parameter, but that is probably overkill. Comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Rewriting Free Space Map
I've started working on revamping Free Space Map, using the approach where we store a map of heap pages on every nth heap page. What we need now is discussion on the details of how exactly it should work. Here's my rough plan on the implementation. It's long, sorry. I'm fairly confident with it, but please let me know if you see any problems or have any suggestions or better ideas. Heap FSM The FSM is stored in the special area of every nth heap page. When extending the relation, the heapam checks if the block number of the new page is one that belongs to the FSM. If it is, it let's the FSM to initialize it by calling InitFSMPage() on it, and extends the relation again to get another, normal heap page. I chose the every nth page is an FSM page approach, rather than using a separate relfilenode, which I also considered. The separate relfilenode approach has some advantages, like being able to scan all FSM pages in a sequential fashion, but it involves a fair amount of catalog and buffer manager changes. It's convenient that the FSM uses up the whole page, leaving no room for heap tuples. It simplifies the locking, as we don't need to worry with the possibility in the FSM that the caller is already holding a lock on the same page. In an FSM page, there's one byte for each of the next N heap pages, starting from the FSM page. That one byte stores the amount of free space on the corresponding heap page, in BLCKSZ/256 byte precision (32 bytes with default block size). The mapping of free space to these 256 buckets wouldn't necessarily have to be a linear one, we could for example have a single bucket for pages with more than BLCKSZ/2 bytes of free space and divide the rest linearly into 16 byte buckets, but let's keep it simple for now. Of course, we could also just use 2 bytes per page, and store the page size exactly, but 32 byte precision seems like enough to me. Index FSM - Indexes use a similar scheme, but since we only need to keep track whether a page is used or not, we only need one bit per page. If the amount of free space on pages is interesting for an indexam in the future, it can use the heap FSM implementation instead. Or no FSM at all, like the hash indexam. To use the index FSM, the indexam needs to leave every nth page alone, like in the heap. The B-tree assumes that the metapage is at block 0, but that's also where we would like to store the first index FSM page. To overcome that, we can make the index FSM special area a little bit smaller, so that the B-tree metadata fits on the same page as the FSM information. That will be wasted space on other FSM pages than block 0, but we're only talking about 24 bytes per FSM page, and we only need one FSM page per ~512 MB of index pages (with default BLCKSZ). Performance --- The patch I'm working on currently uses a naive way to find a page in the FSM. To find a page with X bytes of free space, it scans the FSM pages until one is found. And it always starts the scan from the beginning, which is far from optimal. And when there's page with enough free space, it still needs to scan all FSM pages just to find out that we need to extend the relation. To speed things up, we're going to need some mechanism to avoid that. First of all, we need to somehow cache the information that there's no page with = X bytes left, to avoid fruitless scanning. To speed up the case when there's only a few pages with enough free space, we can keep a limited size list of such pages in addition to the map. These information needs to be in shared memory, either on heap pages like the FSM pages and managed by the buffer manager, or in a separate shmem block. I would like to go with normal bufmgr managed pages, as fixed-sized memory blocks have their problems, and the cached information should be stored to disk as well. Let's have one special page in the heap file, called the Free Space List (FSL) page, in addition to the normal FSM pages. It has the following structure: struct { bit anypages[256] struct { BlockNumber blockno; uint8 freespace; } freespacelist[as large as fits on page] } Remember that we track the free space on each page using one byte, IOW, each page falls into one of 256 buckets of free space. In the anypages bitmap, we have one bit per bucket indicating is there any pages with this much free space. When we look for a page with X bytes, we check the bits up to the bucket corresponding X bytes, and if there's no set bits we know not to bother scanning the FSM pages. To speed up the scan where there is space, we keep a simple list of pages with free space. This list is actually like the current FSM, but here we only use it as a small cache of the FSM pages. VACUUM and any other operations that update the FSM can put pages to the list when there's free slots. We can store the FSL page on a magic location, say block #100. For relations
Re: [HACKERS] Single table forcing sequential scans on query plans
On Sun, 16 Mar 2008, Tom Lane wrote: I have a weird query execution plan problem I am trying to debug on Postgresql 8.2.6. I have a query that joins against a temporary table that has very few rows. Is it possible that the temp table ever has exactly zero rows? Ah, that is indeed a possibility. If I am to understand correctly, there is no way to represent the difference between an un-analyzed table and a zero-sized analyzed table as far as the query planner is concerned? Looks like I'll have to do a select count(*) before running query to avoid entering this trap. (That feels a bit suboptimal since the conary repository code does extensive work with/through temporary tables, and this could very well end up not being the only section affected...) That's entirely the wrong way to think about it. The planner is choosing a good plan based on its estimates of table sizes, which are wildly different in the two cases: - Seq Scan on tmpinstanceid (cost=0.00..1.02 rows=2 width=8) (actual time=0.005..0.007 rows=2 loops=1) - Seq Scan on tmpinstanceid (cost=0.00..29.40 rows=1940 width=8) In this particular case it would be nice if there would be a differentiation between estimate size 0 and estimate size unknown. The only idea I have for how the planner could ignore a previous analyze result is if the analyze found the table to be of zero size. Then the heuristic would still be applied because relpages == 0. For now I will try to run with the assumption that the massive sequential scans are caused by joing an empty table in the query and try to work my way around it - unless there is some trick to tell the planner that this is a query that would be much better optimized away instead of causing a massive IO storm. Thanks, Cristian -- Cristian Gafton rPath, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
Heikki Linnakangas [EMAIL PROTECTED] writes: I've started working on revamping Free Space Map, using the approach where we store a map of heap pages on every nth heap page. What we need now is discussion on the details of how exactly it should work. You're cavalierly waving away a whole boatload of problems that will arise as soon as you start trying to make the index AMs play along with this :-(. Hash for instance has very narrow-minded ideas about page allocation within its indexes. Also, I don't think that use the special space will scale to handle other kinds of maps such as the proposed dead space map. (This is exactly why I said the other day that we need a design roadmap for all these ideas.) The idea that's becoming attractive to me while contemplating the multiple-maps problem is that we should adopt something similar to the old Mac OS idea of multiple forks in a relation. In addition to the main data fork which contains the same info as now, there could be one or more map forks which are separate files in the filesystem. They are named by relfilenode plus an extension, for instance a relation with relfilenode NNN would have a data fork in file NNN (plus perhaps NNN.1, NNN.2, etc) and a map fork named something like NNN.map (plus NNN.map.1 etc as needed). We'd have to add one more field to buffer lookup keys (BufferTag) to disambiguate which fork the referenced page is in. Having bitten that bullet, though, the idea trivially scales to any number of map forks with potentially different space requirements and different locking and WAL-logging requirements. Another possible advantage is that a new map fork could be added to an existing table without much trouble. Which is certainly something we'd need if we ever hope to get update-in-place working. The main disadvantage I can see is that for very small tables, the percentage overhead from multiple map forks of one page apiece is annoyingly high. However, most of the point of a map disappears if the table is small, so we might finesse that by not creating any maps until the table has reached some minimum size. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Single table forcing sequential scans on query plans
Cristian Gafton [EMAIL PROTECTED] writes: On Sun, 16 Mar 2008, Tom Lane wrote: Is it possible that the temp table ever has exactly zero rows? Ah, that is indeed a possibility. If I am to understand correctly, there is no way to represent the difference between an un-analyzed table and a zero-sized analyzed table as far as the query planner is concerned? While thinking about your report I was considering having VACUUM and ANALYZE always set relpages to at least 1. Then seeing relpages=0 would indeed indicate a never-analyzed table, whereas relpages=1 when physical table size is zero could be taken to indicate that we should trust the table to be really empty. I'm not sure though whether this sort of convention would confuse any existing code. Another possibility (though not a back-patchable solution) is that we could just dispense with the heuristic size estimate and trust a zero-sized table to stay zero-sized. This would be relying on the assumption that autovacuum will kick in and update the stats, leading to invalidation of any existing plans that assume the table is small. I don't feel very comfortable about that though --- throwing a few hundred tuples into a table might not be enough to draw autovacuum's attention, but it could surely be enough to create a performance disaster for nestloop plans. Could you confirm that your problem cases are actually caused by this effect and not something else? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rewriting Free Space Map
Tom Lane wrote: The idea that's becoming attractive to me while contemplating the multiple-maps problem is that we should adopt something similar to the old Mac OS idea of multiple forks in a relation. In addition to the main data fork which contains the same info as now, there could be one or more map forks which are separate files in the filesystem. I think something similar could be used to store tuple visibility bits separately from heap tuple data itself, so +1 to this idea. (The rough idea in my head was that you can do an indexscan and look up visibility bits without having to pull the whole heap along; and visibility updates are also cheaper, whether they come from indexscans or heap scans. Of course, the implicit cost is that a seqscan needs to fetch the visibility pages, too; and the locking is more complex.) -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Commit fest?
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Gregory Stark wrote: Bruce Momjian [EMAIL PROTECTED] writes: The comments are stored at JS-Kit: http://js-kit.com/comments/ It's stored in their server? Yes, that was the beauty of it --- I just add javascript with a tag and all comments are handled by them. Beauty? I don't think we want to rely on non-project-controlled servers for anything that's part of our core infrastructure. If/when js-kit.com goes belly-up, what happens to that data? Also, what kind of privacy guarantees have we got? (Admittedly, privacy may be moot for information that was originally entered on a public web page, but the whole idea of someone else controlling our data just makes me itch.) I can go along with this as a jury-rig setup for our first commit fest, but it just seems like another powerful argument for moving to something wiki-based as soon as we can get that sorted. We don't need these comments for more than a few weeks --- I don't see a problem with it. -- Bruce Momjian [EMAIL PROTECTED]http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Commit fest?
Gregory Stark wrote: Bruce Momjian [EMAIL PROTECTED] writes: You want the message-id on the listing page? Sure, I was doing that before but I didn't know anyone wanted it and it looked a little cluttered. Let me know. I agree it was too cluttered for normal use. What I'm trying to do is get a page which has the message-id's of all the messages and the comments on the same page. That way I can dump the data into a text file to experiment with. Why not grab the mbox file and grep out the message ids? -- Bruce Momjian [EMAIL PROTECTED]http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Commit fest?
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Gregory Stark wrote: Bruce Momjian [EMAIL PROTECTED] writes: The comments are stored at JS-Kit: http://js-kit.com/comments/ It's stored in their server? Yes, that was the beauty of it --- I just add javascript with a tag and all comments are handled by them. Beauty? I don't think we want to rely on non-project-controlled servers for anything that's part of our core infrastructure. If/when js-kit.com goes belly-up, what happens to that data? Also, what kind of privacy guarantees have we got? (Admittedly, privacy may be moot for information that was originally entered on a public web page, but the whole idea of someone else controlling our data just makes me itch.) I can go along with this as a jury-rig setup for our first commit fest, but it just seems like another powerful argument for moving to something wiki-based as soon as we can get that sorted. We could move to a wiki if someone finds out how to dump emails into a wiki, or if we decide to be more structured in our methods, like having separate URLs for bugs, feature requests, and patches, and doing all activity on those items at those URLs. -- Bruce Momjian [EMAIL PROTECTED]http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Commit fest?
Gregory Stark wrote: Bruce Momjian [EMAIL PROTECTED] writes: You want the message-id on the listing page? Sure, I was doing that before but I didn't know anyone wanted it and it looked a little cluttered. Let me know. I agree it was too cluttered for normal use. What I'm trying to do is get a page which has the message-id's of all the messages and the comments on the same page. That way I can dump the data into a text file to experiment with. Oh, what I would really like is to be able to pull up archives.postgresql.org emails based on message id so I can link to the entire thread. Unfortunately, it doesn't work there, nor does Google or any of the other Postgres email archive sites. -- Bruce Momjian [EMAIL PROTECTED]http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Commit fest?
Bruce Momjian [EMAIL PROTECTED] writes: Tom Lane wrote: I can go along with this as a jury-rig setup for our first commit fest, but it just seems like another powerful argument for moving to something wiki-based as soon as we can get that sorted. We could move to a wiki if someone finds out how to dump emails into a wiki, or if we decide to be more structured in our methods, like having separate URLs for bugs, feature requests, and patches, and doing all activity on those items at those URLs. I don't think we want to dump emails into a wiki. What we want is a page with links into the mail archives plus commentary. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Commit fest?
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Tom Lane wrote: I can go along with this as a jury-rig setup for our first commit fest, but it just seems like another powerful argument for moving to something wiki-based as soon as we can get that sorted. We could move to a wiki if someone finds out how to dump emails into a wiki, or if we decide to be more structured in our methods, like having separate URLs for bugs, feature requests, and patches, and doing all activity on those items at those URLs. I don't think we want to dump emails into a wiki. What we want is a page with links into the mail archives plus commentary. So the thread titles get put into a wiki that supports comments with URL links to our archives? -- Bruce Momjian [EMAIL PROTECTED]http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] New style of hash join proposal
Gregory Stark [EMAIL PROTECTED] writes: We currently execute a lot of joins as Nested Loops which would be more efficient if we could batch together all the outer keys and execute a single inner bitmap index scan for all of them together. Please give an example of what you're talking about that you think we can't do now. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Single table forcing sequential scans on query plans
On Sun, 16 Mar 2008, Tom Lane wrote: Ah, that is indeed a possibility. If I am to understand correctly, there is no way to represent the difference between an un-analyzed table and a zero-sized analyzed table as far as the query planner is concerned? While thinking about your report I was considering having VACUUM and ANALYZE always set relpages to at least 1. Then seeing relpages=0 would indeed indicate a never-analyzed table, whereas relpages=1 when physical table size is zero could be taken to indicate that we should trust the table to be really empty. I'm not sure though whether this sort of convention would confuse any existing code. If having a discrepancy between relpages and table size is a concern, could relpages be a negative value to mark a non-analyzed table? Another possibility (though not a back-patchable solution) is that we could just dispense with the heuristic size estimate and trust a zero-sized table to stay zero-sized. This would be relying on the I think improving the estimator would get us further, since in most cases it seems to get it relatively right. Could you confirm that your problem cases are actually caused by this effect and not something else? Yes, confirmed. The runaway queries all are joining against an empty temporary table. Thanks, Cristian -- Cristian Gafton rPath, Inc. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Commit fest?
On Sun, 16 Mar 2008, Bruce Momjian wrote: Oh, what I would really like is to be able to pull up archives.postgresql.org emails based on message id so I can link to the entire thread. Unfortunately, it doesn't work there, nor does Google or any of the other Postgres email archive sites. This is something I've been looking into my own organization. The message ids are at the start of the archive web pages. For example your e-mail here I'm replying to begins like this if you look at the page source: http://archives.postgresql.org/pgsql-hackers/2008-03/msg00554.php !-- MHonArc v2.6.16 -- !--X-Subject: Re: Commit fest? -- !--X-From-R13: Pehpr [bzwvna oehprNzbzwvna.hf -- !--X-Date: Sun, 16 Mar 2008 23:19:33 #45;0300 (ADT) -- !--X-Message-Id: [EMAIL PROTECTED] -- !--X-Content-Type: text/plain -- !--X-Reference: [EMAIL PROTECTED] -- !--X-Head-End-- I was thinking of writing something that scraped the archives building a lookup table out of this information. What would be nice is if the X-Message-Id and X-Reference were both put into the regular HTML for future archived messages so that it's more likely tools like Google could search based on them. A brief glance at the MHonArc documentation suggests that could be run to re-covert any existing messages that are still available in order to add to those even. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Remove hacks for old bad qsort() implementations?
There are several places in tuplesort.c (and perhaps elsewhere) where we explicitly work around limitations of various platforms' qsort() functions. Notably, there's this bit in comparetup_index_btree /* * If key values are equal, we sort on ItemPointer. This does not affect * validity of the finished index, but it offers cheap insurance against * performance problems with bad qsort implementations that have trouble * with large numbers of equal keys. */ which I unquestioningly copied into comparetup_index_hash yesterday. However, oprofile is telling me that doing this is costing *significantly* more than just returning zero would do: 9081 0.3050 :tuple1 = (IndexTuple) a-tuple; 3759 0.1263 :tuple2 = (IndexTuple) b-tuple; : :{ 130409 4.3800 :BlockNumber blk1 = ItemPointerGetBlockNumber(tuple1-t_tid); 34539 1.1601 :BlockNumber blk2 = ItemPointerGetBlockNumber(tuple2-t_tid); : 3281 0.1102 :if (blk1 != blk2) 812 0.0273 :return (blk1 blk2) ? -1 : 1; :} :{ 28 9.4e-04 :OffsetNumber pos1 = ItemPointerGetOffsetNumber(tuple1-t_tid); 1 3.4e-05 :OffsetNumber pos2 = ItemPointerGetOffsetNumber(tuple2-t_tid); : 1 3.4e-05 :if (pos1 != pos2) 48757 1.6376 :return (pos1 pos2) ? -1 : 1; :} : :return 0; 56705 1.9045 :} Looks to me like we're eating more than seven percent of the total runtime to do this :-( Now as far as I can see, the original motivation for this (as stated in the comment) is entirely dead anymore, since we always use our own qsort implementation in preference to whatever bogus version a given libc might supply. What do people think of removing this bit of code in favor of just returning 0? I can see a couple of possible objections: 1. Someday we might go back to using platform qsort. (But surely we could insist on qsort behaving sanely for equal keys.) 2. If you've got lots of equal keys, it's conceivable that having the index entries sorted by TID offers some advantage in indexscan speed. I'm dubious that that's useful, mainly because the planner should prefer a bitmap scan in such a case; and anyway the ordering is unlikely to be preserved for long. But it's something to think about. Comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [4/4] Proposal of SE-PostgreSQL patches
[4/4] - sepostgresql-policy-8.4devel-3.patch This patch gives us the default security policy for SE-PostgreSQL. You can build it as a security policy module. It can be linked with the existing distributor's policy, and reloaded. -- OSS Platform Development Division, NEC KaiGai Kohei [EMAIL PROTECTED] diff -rpNU3 pgace/contrib/sepgsql-policy/Makefile sepgsql/contrib/sepgsql-policy/Makefile --- pgace/contrib/sepgsql-policy/Makefile 1970-01-01 09:00:00.0 +0900 +++ sepgsql/contrib/sepgsql-policy/Makefile 2008-03-12 20:00:04.0 +0900 @@ -0,0 +1,20 @@ +# SE-PostgreSQL Security Policy +#-- + +SHAREDIR := /usr/share/selinux + +AWK ?= gawk +NAME ?= $(strip $(shell $(AWK) -F= '/^SELINUXTYPE/{ print $$2 }' /etc/selinux/config)) + +SELINUX_POLICY := /usr/share/selinux + +all: sepostgresql.pp + +install: all + install -m 0644 sepostgresql.pp $(SELINUX_POLICY)/$(NAME) + +sepostgresql.pp: sepostgresql.te sepostgresql.if sepostgresql.fc + make -f $(SELINUX_POLICY)/devel/Makefile NAME=$(NAME) + +clean: + make -f $(SELINUX_POLICY)/devel/Makefile NAME=$(NAME) clean diff -rpNU3 pgace/contrib/sepgsql-policy/sepostgresql.fc sepgsql/contrib/sepgsql-policy/sepostgresql.fc --- pgace/contrib/sepgsql-policy/sepostgresql.fc 1970-01-01 09:00:00.0 +0900 +++ sepgsql/contrib/sepgsql-policy/sepostgresql.fc 2008-03-13 10:21:48.0 +0900 @@ -0,0 +1,17 @@ +# +# SE-PostgreSQL install path +# +/usr/bin/sepostgres -- gen_context(system_u:object_r:postgresql_exec_t,s0) +/usr/bin/initdb.sepgsql -- gen_context(system_u:object_r:postgresql_exec_t,s0) +/usr/bin/sepg_ctl -- gen_context(system_u:object_r:initrc_exec_t,s0) + +/var/lib/sepgsql(/.*)? gen_context(system_u:object_r:postgresql_db_t,s0) +/var/lib/sepgsql/pgstartup\.log gen_context(system_u:object_r:postgresql_log_t,s0) +/var/log/sepostgresql\.log.* -- gen_context(system_u:object_r:postgresql_log_t,s0) + +# +# For source installation +# +/usr/local/pgsql/bin/postgres -- gen_context(system_u:object_r:postgresql_exec_t,s0) +/usr/local/pgsql/bin/initdb -- gen_context(system_u:object_r:postgresql_exec_t,s0) +/usr/local/pgsql/bin/pg_ctl -- gen_context(system_u:object_r:initrc_exec_t,s0) diff -rpNU3 pgace/contrib/sepgsql-policy/sepostgresql.if sepgsql/contrib/sepgsql-policy/sepostgresql.if --- pgace/contrib/sepgsql-policy/sepostgresql.if 1970-01-01 09:00:00.0 +0900 +++ sepgsql/contrib/sepgsql-policy/sepostgresql.if 2008-03-12 20:00:04.0 +0900 @@ -0,0 +1,88 @@ + +## summary +## Marks the specified domain as SE-PostgreSQL server process. +## /summary +## param name=domain +## summary +## Domain to be marked +## /summary +## /param +# +interface(`sepgsql_server_domain',` + gen_require(` + attribute sepgsql_server_type; + ') + typeattribute $1 sepgsql_server_type; +') + + +## summary +## Allow the specified domain unconfined accesses to any database objects +## managed by SE-PostgreSQL, +## /summary +## param name=domain +## summary +## Domain allowed access. +## /summary +## /param +# +interface(`sepgsql_unconfined_domain',` + gen_require(` + attribute sepgsql_unconfined_type; + attribute sepgsql_client_type; + ') + typeattribute $1 sepgsql_unconfined_type; + typeattribute $1 sepgsql_client_type; +') + + +## summary +## Allow the specified domain unprivileged accesses to any database objects +## managed by SE-PostgreSQL, +## /summary +## param name=domain +## summary +## Domain allowed access. +## /summary +## /param +# +interface(`sepgsql_client_domain',` + gen_require(` + attribute sepgsql_client_type; + ') + typeattribute $1 sepgsql_client_type; +') + + +## summary +## Allow the specified role to invoke trusted procedures +## /summary +## param name=role +## summary +## The role associated with the domain. +## /summary +## /param +# +interface(`sepgsql_trusted_procedure_role',` + gen_require(` + type sepgsql_trusted_domain_t; + ') + role $1 types sepgsql_trusted_domain_t; +') + + +## summary +## Marks as a SE-PostgreSQL loadable shared library module +## /summary +## param name=type +## summary +## Type marked as a database object type. +## /summary +## /param +# +interface(`sepgsql_loadable_module',` + gen_require(` + attribute sepgsql_module_type; + ') + typeattribute $1 sepgsql_module_type; +') diff -rpNU3 pgace/contrib/sepgsql-policy/sepostgresql.te sepgsql/contrib/sepgsql-policy/sepostgresql.te --- pgace/contrib/sepgsql-policy/sepostgresql.te 1970-01-01 09:00:00.0 +0900 +++ sepgsql/contrib/sepgsql-policy/sepostgresql.te 2008-03-12 20:00:04.0 +0900 @@ -0,0 +1,353 @@ +policy_module(sepostgresql, 3.01) + +gen_require(` +class db_database all_db_database_perms; +class db_table all_db_table_perms; +
[HACKERS] [3/4] Proposal of SE-PostgreSQL patches
[3/4] - sepostgresql-pg_dump-8.4devel-3.patch This patch gives us a feature to dump database with security attribute. It is turned on with '--enable-selinux' option at pg_dump/pg_dumpall, when the server works as SE- version. No need to say, users need to have enough capabilities to dump whole of database. It it same when they tries to restore the database. -- OSS Platform Development Division, NEC KaiGai Kohei [EMAIL PROTECTED] diff -rpNU3 pgace/src/bin/pg_dump/pg_dump.c sepgsql/src/bin/pg_dump/pg_dump.c --- pgace/src/bin/pg_dump/pg_dump.c 2008-02-03 01:18:48.0 +0900 +++ sepgsql/src/bin/pg_dump/pg_dump.c 2008-02-03 01:26:35.0 +0900 @@ -118,6 +118,9 @@ static int g_numNamespaces; /* flag to turn on/off dollar quoting */ static int disable_dollar_quoting = 0; +/* flag to tuen on/off SE-PostgreSQL support */ +#define SELINUX_SYSATTR_NAME security_context +static int enable_selinux = 0; static void help(const char *progname); static void expand_schema_name_patterns(SimpleStringList *patterns, @@ -267,6 +270,7 @@ main(int argc, char **argv) {disable-dollar-quoting, no_argument, disable_dollar_quoting, 1}, {disable-triggers, no_argument, disable_triggers, 1}, {use-set-session-authorization, no_argument, use_setsessauth, 1}, + {enable-selinux, no_argument, enable_selinux, 1}, {NULL, 0, NULL, 0} }; @@ -419,6 +423,8 @@ main(int argc, char **argv) disable_triggers = 1; else if (strcmp(optarg, use-set-session-authorization) == 0) use_setsessauth = 1; +else if (strcmp(optarg, enable-selinux) == 0) + enable_selinux = 1; else { fprintf(stderr, @@ -549,6 +555,24 @@ main(int argc, char **argv) std_strings = PQparameterStatus(g_conn, standard_conforming_strings); g_fout-std_strings = (std_strings strcmp(std_strings, on) == 0); + if (enable_selinux) { + /* confirm whther server support SELinux features */ + const char *tmp = PQparameterStatus(g_conn, security_sysattr_name); + + if (!tmp) { + write_msg(NULL, could not get security_sysattr_name from libpq\n); + exit(1); + } + if (!!strcmp(SELINUX_SYSATTR_NAME, tmp) != 0) { + write_msg(NULL, server does not have SELinux feature\n); + exit(1); + } + if (g_fout-remoteVersion 80204) { + write_msg(NULL, server version is too old (%u)\n, g_fout-remoteVersion); + exit(1); + } + } + /* Set the datestyle to ISO to ensure the dump's portability */ do_sql_command(g_conn, SET DATESTYLE = ISO); @@ -771,6 +795,7 @@ help(const char *progname) printf(_( --use-set-session-authorization\n use SESSION AUTHORIZATION commands instead of\n ALTER OWNER commands to set ownership\n)); + printf(_( --enable-selinuxenable to dump security context in SE-PostgreSQL\n)); printf(_(\nConnection options:\n)); printf(_( -h, --host=HOSTNAME database server host or socket directory\n)); @@ -1160,7 +1185,8 @@ dumpTableData_insert(Archive *fout, void if (fout-remoteVersion = 70100) { appendPQExpBuffer(q, DECLARE _pg_dump_cursor CURSOR FOR - SELECT * FROM ONLY %s, + SELECT * %s FROM ONLY %s, + (!enable_selinux ? : , SELINUX_SYSATTR_NAME), fmtQualifiedId(tbinfo-dobj.namespace-dobj.name, classname)); } @@ -1774,11 +1800,32 @@ dumpBlobComments(Archive *AH, void *arg) Oid blobOid; char *comment; + blobOid = atooid(PQgetvalue(res, i, 0)); + + /* dump security context of binary large object */ + if (enable_selinux) { +PGresult *__res; +char query[512]; + +snprintf(query, sizeof(query), + SELECT lo_get_security(%u), blobOid); +__res = PQexec(g_conn, query); +check_sql_result(__res, g_conn, query, PGRES_TUPLES_OK); + +if (PQntuples(__res) != 1) { + write_msg(NULL, lo_get_security(%u) returns %d tuples\n, + blobOid, PQntuples(__res)); + exit_nicely(); +} +archprintf(AH, SELECT lo_set_security(%u, '%s');\n, + blobOid, PQgetvalue(__res, 0, 0)); +PQclear(__res); + } + /* ignore blobs without comments */ if (PQgetisnull(res, i, 1)) continue; - blobOid = atooid(PQgetvalue(res, i, 0)); comment = PQgetvalue(res, i, 1); printfPQExpBuffer(commentcmd, COMMENT ON LARGE OBJECT %u IS , @@ -2886,6 +2933,7 @@ getTables(int *numTables) int i_owning_col; int i_reltablespace; int i_reloptions; + int i_selinux; /* Make sure we are in proper schema */ selectSourceSchema(pg_catalog); @@ -2926,6 +2974,7 @@ getTables(int *numTables) d.refobjsubid as owning_col, (SELECT spcname FROM pg_tablespace t WHERE t.oid = c.reltablespace) AS reltablespace, array_to_string(c.reloptions, ', ') as reloptions + %s from pg_class c left join pg_depend d on (c.relkind = '%c' and @@ -2935,6 +2984,7 @@ getTables(int *numTables) where relkind in ('%c', '%c', '%c', '%c') order by
Re: [HACKERS] Single table forcing sequential scans on query plans
Cristian Gafton [EMAIL PROTECTED] writes: On Sun, 16 Mar 2008, Tom Lane wrote: While thinking about your report I was considering having VACUUM and ANALYZE always set relpages to at least 1. Then seeing relpages=0 would indeed indicate a never-analyzed table, whereas relpages=1 when physical table size is zero could be taken to indicate that we should trust the table to be really empty. I'm not sure though whether this sort of convention would confuse any existing code. If having a discrepancy between relpages and table size is a concern, could relpages be a negative value to mark a non-analyzed table? No, the value is really a uint32, though we don't declare it that way for lack of having any such SQL type :-(. (uint32)-1 is just as legal a value as 1, though perhaps a lot less likely. Anyway, client code looking at the column is probably more likely to get confused by a negative value for relpages than by a value that doesn't match underlying reality (which it can't easily see anyway). Could you confirm that your problem cases are actually caused by this effect and not something else? Yes, confirmed. The runaway queries all are joining against an empty temporary table. Good, just wanted to be sure. If there are not objections, I'll put in the at-least-1 hack. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [0/4] Proposal of SE-PostgreSQL patches
It seems to me some of SE-PostgreSQL patches are not delivered yet, although [3/4] and [4/4] were already done. Does anti-spam system caught my previous three messages? If necessary, I will send them again. Thanks, Kohei KaiGai wrote: The series of patches are the proposal of Security-Enhanced PostgreSQL (SE-PostgreSQL) for the upstreamed PostgreSQL 8.4 development cycle. [1/4] sepostgresql-pgace-8.4devel-3.patch provides PGACE (PostgreSQL Access Control Extension) framework [2/4] sepostgresql-sepgsql-8.4devel-3.patch provides SE-PostgreSQL feature, based on PGACE framework. [3/4] sepostgresql-pg_dump-8.4devel-3.patch enables pg_dump to dump database with security attribute. [4/4] sepostgresql-policy-8.4devel-3.patch provides the default security policy for SE-PostgreSQL. - snip - -- OSS Platform Development Division, NEC KaiGai Kohei [EMAIL PROTECTED] -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [0/4] Proposal of SE-PostgreSQL patches
The series of patches are the proposal of Security-Enhanced PostgreSQL (SE-PostgreSQL) for the upstreamed PostgreSQL 8.4 development cycle. [1/4] sepostgresql-pgace-8.4devel-3.patch provides PGACE (PostgreSQL Access Control Extension) framework [2/4] sepostgresql-sepgsql-8.4devel-3.patch provides SE-PostgreSQL feature, based on PGACE framework. [3/4] sepostgresql-pg_dump-8.4devel-3.patch enables pg_dump to dump database with security attribute. [4/4] sepostgresql-policy-8.4devel-3.patch provides the default security policy for SE-PostgreSQL. We can provide a quick overview of SE-PostgreSQL at: http://code.google.com/p/sepgsql/wiki/WhatIsSEPostgreSQL Any comment and suggestion are welcome. Thanks, ENVIRONMENT --- Please confirm your environment. The followings are requriements of SE-PostgreSQL. * Fedora 8 or later system * SELinux is enabled and working * kernel-2.6.24 or later * selinux-policy and selinux-policy-devel v3.0.8 or later * libselinux, policycoreutils INSTALLATION $ tar jxvf postgresql-snapshot.tar.bz2 $ cd postgresql-snapshot $ patch -p1 ../sepostgresql-pgace-8.4devel-3.patch $ patch -p1 ../sepostgresql-sepgsql-8.4devel-3.patch $ patch -p1 ../sepostgresql-pg_dump-8.4devel-3.patch $ patch -p1 ../sepostgresql-policy-8.4devel-3.patch $ ./configure --enable-selinux $ make $ make -C contrib/sepgsql-policy $ su # make install # /usr/sbin/semodule -i contrib/sepgsql-policy/sepostgresql.pp (NOTE: semodule is a utility to load/unload security policy modules.) # /sbin/restorecon -R /usr/local/pgsql (NOTE: restorecon is a utilicy to initialize security context of files.) SETUP - # mkdir -p /opt/sepgsql # chown foo_user:var_group /opt/sepgsql # chcon -t postgresql_db_t /opt/sepgsql (NOTE: chcon is a utility to set up security context of files.) # exit $ /usr/sbin/run_init /usr/local/pgsql/bin/initdb -D /opt/sepgsql (NOTE: run_init is a utility to start a program, as if it is branched from init script.) $ /usr/local/pgsql/bin/pg_ctl -D /opt/sepgsql start -- OSS Platform Development Division, NEC KaiGai Kohei [EMAIL PROTECTED] -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers